I built a retrieval-augmented generation pipeline over our internal docs, demoed it, and watched it answer a question with total confidence and total inaccuracy. It cited a real document. It just cited the wrong part of it, and then the model cheerfully filled the gaps with plausible invention. The demo audience nodded along, which was the worst part, because the answer was wrong and it sounded right.
RAG is sold as the easy win: embed your documents, stick them in a vector store, retrieve the top matches, stuff them in the prompt, done. And the happy path really is that easy to stand up. The problem is that "stands up" and "is useful" are separated by a surprising amount of fiddly, unglamorous work, none of which is in the tutorials.
Here's what I got wrong and what fixed it.
chunking was the whole game
My first pass split documents into fixed 1000-character chunks with no overlap. This is the default everywhere and it's quietly terrible. It cuts sentences in half. It separates a heading from the paragraph that explains it. It puts the question ("what timeout should I use?") in one chunk and the answer ("the default is 30 seconds") in the next, so retrieving one never brings the other.
# the naive version
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]
The fix was to chunk on structure rather than character count. Split on headings and paragraph boundaries, keep chunks within a sensible token budget, and add a generous overlap so context that straddles a boundary survives. Even better: prepend the document title and section heading to each chunk before embedding, so a chunk that says "the default is 30 seconds" embeds as "Configuration > Timeouts: the default is 30 seconds" and actually matches a question about timeouts.
That single change did more for answer quality than anything else I tried.
embeddings don't understand your jargon
The second problem was that semantic similarity is only as good as the embedding model's idea of what's similar. Our docs are full of internal product names and acronyms that mean nothing to a general-purpose embedding model. A user asking about "Halyard" (an internal service) got back chunks about unrelated things, because to the embedder "Halyard" is just a nautical word and the actual relevant doc never used it by name in the chunk that mattered.
Pure vector search is bad at exact-match terms: identifiers, error codes, version numbers, the precise things engineers actually search for. The answer is not to throw out vector search, it's to stop relying on it alone. I added a keyword search (a plain BM25 index) alongside the vector search and combined the results. Hybrid retrieval. The vector side catches "how do I make this faster" matching a doc titled "performance tuning"; the keyword side catches someone pasting an exact error string. You want both.
retrieving the right thing isn't the same as using it
Even with good chunks and hybrid search, the model would sometimes ignore the retrieved context and answer from its own training, or blend the two into something that was neither. Two things helped.
First, the prompt. Being explicit and slightly stern works: tell the model to answer only from the provided context, to say "I don't know" when the context doesn't contain the answer, and to quote the source. Models are pleasers by default and will invent rather than admit ignorance unless you give them permission to admit it.
Second, a re-ranking step. Vector retrieval gives you twenty candidates that are roughly relevant. A cross-encoder re-ranker reads the question and each candidate together and scores them properly, which is slower but far more accurate than the initial similarity score. I retrieve twenty, re-rank, and pass the top five to the model. The first-pass retrieval just has to get the right chunk into the top twenty; the re-ranker does the precision work.
the unglamorous truth
None of this is clever. There's no novel architecture here, no fine-tuning, nothing you'd write a paper about. It's chunking with care, hybrid retrieval, a re-ranker, and a prompt that gives the model an honourable way out. The interesting research is all upstream; the thing that makes your pipeline actually useful is downstream plumbing and a willingness to look at what it retrieved and ask why.
My first attempt was useless because I treated RAG as a solved, four-step recipe. It isn't. The retrieval is the product. The model on top is almost incidental: feed it the right three paragraphs and even a modest model answers well, feed it the wrong ones and the best model in the world will lie to you beautifully.
If you're starting out, do this: build an evaluation set first. Twenty real questions with known answers. Then every change you make, chunking, hybrid, re-ranking, you can measure instead of guess. I built mine after the embarrassing demo, which is exactly the wrong order, so learn from that and build it before yours.