In 2026, the vast majority of AI chatbots deployed in enterprise rely on RAG (Retrieval-Augmented Generation). Yet in our audits, we see a huge gap between projects that work and those that languish: it's almost never the LLM that's the problem. It's the knowledge base. The same GPT-4 Turbo or the same Mistral Large 2, plugged into two different RAG pipelines, can produce a brilliant assistant or a polite hallucination generator. 90% of the difference is in retrieval engineering.

This article is a technical guide to building a RAG that genuinely makes your chatbot useful. We detail six steps: source audit, chunking, embeddings, vector database, reranking, evaluation. Each step carries concrete trade-offs documented with numbers from our projects. Goal: give you the right questions to ask before signing a RAG budget, and the right signals to detect when an existing project is derailing.

Why RAG has become the decisive step in 2026

Frontier LLMs have improved massively, but they remain fundamentally ignorant of your business. A useful internal assistant must answer based on your documentation, your procedures, your contracts, your products. Fine-tuning remains heavy, costly and inflexible against sources that change every week. RAG won because it offers the right compromise: inject relevant context at query time, without touching the model, with near real-time updates possible.

But RAG is a complex multi-stage system. Each stage can degrade the whole. A poorly cleaned source, coarse chunking, embeddings unsuited to French, a misconfigured vector database, the absence of reranking, and the final answer will be disappointing even if the LLM is excellent. That's why we always treat RAG as its own engineering discipline, with its measurement and continuous improvement loop.

1. Audit and clean sources

Everything starts with the corpus. A RAG chatbot invents nothing: it selects, rephrases and synthesizes what you give it. If internal documentation contains three contradictory versions of a procedure, the RAG will randomly surface one of the three. If a 2019 PDF is forgotten in a SharePoint, it will end up cited. Our systematic first step is a documentary audit: map sources, identify duplicates, mark obsolete content, spot contradictions.

Concretely, we work in batches: 200 to 500 documents reviewed by a business-tech pair, with a qualification grid (authoritative? up to date? coherent?). This work is thankless but decisive. Projects that skip it never catch up: they stack technical layers to compensate for a sick corpus.

  • Remove exact duplicates and near-duplicates
  • Mark obsolete content with an expiration date
  • Resolve contradictions with the business (explicit validation)
  • Normalize formats (clean Markdown over scanned PDFs)

2. Pick a chunking strategy

Chunking is the operation that splits each document into indexable fragments. Three parameters matter: size, overlap and strategy. Too small (under 200 tokens), each chunk loses its context and retrieval brings back incoherent fragments. Too large (over 1,500 tokens), chunks become fuzzy and embeddings average too many different ideas. Our default range: 512 tokens with 64 to 128 tokens of overlap. But this value is never sacred: we calibrate it on the golden dataset.

Strategy matters as much as size. Fixed chunking (every N tokens) is fast to implement but crude on structured documents. Semantic chunking (based on sentence embeddings) better respects meaning units. Structure-aware chunking (which exploits HTML tags, Markdown titles, PDF structure) is almost always superior on technical documentation. For an e-commerce site, a product sheet should remain one chunk. For a technical manual, each H2 or H3 section forms a natural chunk.

3. Select embeddings

Embeddings turn each chunk into a vector. They're the RAG's search organ. The choice depends on three criteria: quality in your language (French remains poorly covered by purely English-language models), dimension (512 to 3,072 depending on the model, direct impact on storage cost and latency) and cost. Our four recurring candidates in 2026: OpenAI text-embedding-3-large (3,072 dimensions, excellent multilingual quality, ~€0.13 per million tokens), Cohere Embed v3 (1,024 dimensions, very good in multilingual, supports typed semantic search), BGE-M3 open-source (1,024 dimensions, remarkable performance, self-hostable), and specialized French models like Solon or CamemBERT-large for very French-language corpora.

Our method: test at least two embedding models on the client's golden dataset. Measure recall@5 and recall@10. The gap is sometimes spectacular — we've seen BGE-M3 beat OpenAI on a very specific French legal corpus, and the opposite on a multilingual technical corpus. No model is universally better.

4. Pick the vector database

The vector database stores embeddings and runs similarity search (cosine, dot-product, L2). Five options dominate. Pinecone managed remains unbeatable on time-to-market: a few lines of code, automatic scaling, but recurring cost and dependency on a US provider. Qdrant open-source, self-hostable, performant, written in Rust: our default choice when sovereignty matters. pgvector if you already have Postgres: no need to add a service, perfect up to a few million vectors. Weaviate for rich schemas and advanced hybrid search. Chroma for quick local POCs.

Indicative cost: €50 to €500 per month depending on volume and mode (managed vs self-host). Beyond 10 million vectors or strong sovereignty constraints, self-host becomes almost mandatory. We advise against stacking tools: a single well-mastered vector store beats three poorly integrated ones.

5. Add reranking

Reranking is the most underestimated step in RAG. Initial retrieval typically brings back a top-20 or top-50 of candidate chunks by vector similarity. But embedding similarity is a crude approximation of real relevance. A reranker — a heavier model, cross-encoder type — reorders these candidates by finely evaluating each (question, chunk) pair. On our projects, adding reranking brings 10 to 15% additional precision on the top-3, which is exactly what's injected into the final prompt.

Two options dominate. Cohere Rerank v3 is an excellent managed service, multilingual, around €1 per 1,000 requests. BGE-reranker is open-source, self-hostable, almost as performant, ideal in a sovereign context. The investment is minimal — a few hours of integration, a few dozen milliseconds of latency — for a quality gain far larger than switching LLMs.

6. Evaluate continuously

A RAG without evaluation is a black box drifting silently. Our standard: from scoping onward, build an internal golden dataset of 50 to 100 question-answer pairs validated by the business, and measure at each iteration the structuring metrics with RAGAS: faithfulness (does the answer stick to the sources?), answer relevance (does it answer the question?), context precision, context recall. These four metrics trace the full pipeline and locate precisely where a regression appears.

On top of this, production evaluation: user feedback (thumbs up/down with comment), monitoring of unsatisfactorily answered questions, A/B testing between two configurations (for example chunking 512 vs 768, with or without reranker). LangChain and LlamaIndex now offer native instrumentation hooks for these measurements, and tools like Langfuse or Arize's Phoenix aggregate them cleanly.

The pitfalls that wreck a RAG

Four mistakes recur systematically in our audits. First, chunking that's too fine or too coarse, defined once and never recalibrated. Second, embeddings chosen by default without testing on French, when language makes a major difference on certain corpora. Third, absence of reranking — the most profitable step in the pipeline, yet regularly omitted. Fourth, absence of an update pipeline: the knowledge base is frozen at deployment date, and the chatbot answers questions about products that no longer exist.

To these technical pitfalls adds a methodological one: judging a RAG "by eye". Without a golden dataset, without RAGAS, without A/B, any quality discussion is subjective and therefore unmanageable. We enforce measurement from the first iteration, because you only improve what you measure.

What's next?

A high-performing RAG isn't a few-day project. It's a system that's designed, measured and maintained. But that investment is precisely what makes the difference between a chatbot your teams actually use and a gadget abandoned after three weeks. At DevHighWay, we treat RAG as the backbone of every chatbot we ship: audited sources, calibrated chunking, tested embeddings, reranker in place, continuous evaluation.

  • Get in touch for a RAG scoping engagement (1-day workshop, quantified technical battle plan)
  • Check our pricing — our packages include RAG maintenance and continuous evaluation
  • Our audit can also cover your assistant's visibility on the SEO and GEO side

If you take one idea from this article: a RAG chatbot's quality isn't decided by the choice of LLM, it's decided by retrieval engineering. The six steps described here aren't optional, they form a system. Skipping one means accepting systemic degradation of final quality. Treating them seriously, with method and measurement, gives you the means to ship an assistant your teams will use every day — and which, over time, will become a strategic asset of your organization.