I built a RAG pipeline over my own notes, asked it a question I knew the answer to, and got a fluent, confident, completely useless reply. It cited the right document and drew the wrong conclusion from it. That is worse than a blank stare, because it looks like it worked.
The premise of retrieval-augmented generation is sound and I still believe in it. You embed your documents, stuff the relevant ones into the prompt at query time, and let the model answer grounded in your data rather than its training. The failure was entirely in the parts nobody puts in the demo.
Chunking is the whole game
My first version split documents on a fixed character count. This is the default in every tutorial and it is quietly terrible. A 500-character window cuts sentences in half, separates a heading from the paragraph that explains it, and orphans a code snippet from the line that says what it does. The embeddings you get from those fragments are mush, and mush retrieves mush.
Splitting on structure instead, by heading and paragraph, with a sentence-aware overlap, did more for answer quality than any model change I tried afterwards. The retriever can only hand the model what the chunks contain. If your chunks are nonsense, the cleverest model in the world will reason beautifully over nonsense.
I trusted cosine similarity too much
The second mistake was treating top-k nearest neighbours as if it returned relevant results. It returns similar embeddings, which is not the same thing, and for short queries against long documents the gap is wide. The fix that helped most was unglamorous: retrieve more candidates than I needed, then re-rank them with a cross-encoder that actually looks at the query and the passage together. Slower, much better. I also added a plain keyword search alongside the vector search, because for exact terms, error codes, function names, a config key, lexical matching still beats embeddings handily.
No evaluation meant no idea
The real root cause was that I had no way to tell whether a change helped. I was eyeballing single answers and trusting my gut, which is how you fool yourself for a week. I wrote a tiny evaluation set, twenty questions with the documents that should be retrieved, and measured whether the right chunks came back at all before worrying about the generation step. Retrieval quality is measurable without a model in the loop, and once I measured it the bottleneck was obvious.
None of this is novel. That is rather the point: the hard parts of RAG are data plumbing and evaluation, the same unglamorous work as any other data system. The model is the easy bit. I'd assumed the opposite, which is why my first attempt was useless.