RAG Is Harder Than the Diagram Makes It Look

Every introduction to retrieval-augmented generation shows the same diagram. User asks question. System searches a knowledge base. Retrieved documents get passed to the model. Model generates an answer. The diagram makes it look like a four-step process. Building it in production turns out to involve considerably more decisions than four.

I have been spending time with teams building RAG systems and the patterns that separate the ones that work from the ones that struggle are not usually in the retrieval algorithm or the choice of embedding model. They are in the decisions that come before and after those.

Before the retrieval step, how you chunk documents matters enormously. The assumption is often that you split by paragraph or by a fixed number of tokens and that is sufficient. In practice, chunking that ignores semantic boundaries produces retrieval results that lose context. A sentence that makes no sense without the three paragraphs before it will surface as a retrieved result and confuse the model. Chunking strategies that preserve semantic coherence, include useful metadata, and handle tables and code differently from prose all require design decisions specific to the type of content you are indexing.

After retrieval, what you do with the results before passing them to the model changes quality significantly. A naive implementation passes the top results in order of similarity score. Better implementations rerank, filter by relevance threshold, deduplicate overlapping chunks, and structure the context in ways that make it easier for the model to use. None of this is complicated, but all of it requires intentional design rather than accepting the default pipeline.

The evaluation problem is particularly stubborn. Knowing whether your RAG system is working well requires ground truth about what the right answers are. For many organisations, that ground truth does not exist in a usable form. Building evaluation datasets is expensive and time-consuming. Running human evaluation at scale is impractical. The result is that many RAG systems are deployed with limited confidence in how they will behave across the full range of queries they will actually receive.

The teams that have built RAG systems that genuinely work in production have usually invested significantly in evaluation infrastructure first. They built the machinery to understand quality before they built the pipeline to improve it. That inversion of the typical build sequence is uncomfortable but appears to be the pattern that leads to reliable outcomes.

My broader observation is that RAG is often described as a solution to hallucination. It is more accurate to say it is a reduction in hallucination risk for certain types of queries about certain types of knowledge, when implemented carefully. That more qualified framing is less appealing as a headline but more useful as a design assumption.

Related Articles