my first rag pipeline was confidently wrong

A robot

I wanted to ask questions of my own notes. A few years of markdown, meeting scribbles, half-finished design docs, the usual sediment. RAG is the obvious shape: embed the documents, store the vectors, retrieve the relevant chunks for a question, stuff them into the prompt, let the model answer from them. I built it in an evening. It was useless, and the interesting part is exactly how it was useless.

It wasn't useless in the way I expected. It didn't error, it didn't hallucinate wildly, it didn't refuse. It answered every question fluently and confidently, and the answers were subtly, persistently wrong. It would tell me a decision went one way when my notes said the other, or cite a project that was adjacent to the one I asked about, or blend two meetings into a single plausible fiction. That's the dangerous failure mode, because fluent and wrong reads exactly like fluent and right until you check.

the model was never the problem

Every single fault traced back to retrieval, not generation. The model could only answer from what I handed it, and I was handing it the wrong chunks. Garbage in, confident garbage out.

chunking by character count is a trap

My first version split documents into 1000-character chunks with a bit of overlap, because that's what every tutorial does. That cuts straight through the middle of sentences, tables, and the one paragraph that actually contained the answer, leaving half of it in one chunk and half in the next, so neither retrieves well. I switched to splitting on structure: headings, then paragraphs, with a maximum size as a backstop rather than the primary rule. Chunks that respect the document's own boundaries retrieve far better, because they're about one thing.

Circuit board

similarity is not relevance

The second trap is trusting the vector search. Cosine similarity finds chunks that are about the same topic as the question, which is not the same as chunks that answer it. Ask "what did we decide about the cache?" and you'll happily retrieve five chunks that mention the cache and none that record the decision, because "we decided" doesn't look much like the question in embedding space. The fix was to stop relying on dense vectors alone and add a keyword search alongside, then merge the two rankings. Hybrid retrieval, dense plus sparse, caught the cases where the exact word mattered and the semantics didn't.

no answer is a valid answer

The worst behaviour was the model inventing an answer when the retrieved chunks didn't contain one. If retrieval comes back with nothing relevant, the right output is "I don't know," not a fluent guess. I changed the prompt to say so explicitly, told it to answer only from the provided context and to say when the context was insufficient, and added a similarity floor so that genuinely poor retrievals returned nothing rather than the five least-bad chunks. That one change did more for trust than anything else, because a system that admits ignorance is one you can actually rely on.

the rerank step that made it click

The thing that finally lifted it from "occasionally useful" to "I reach for this daily" was a reranker. Vector search is cheap and casts a wide net, so I retrieve the top twenty or thirty candidates, then run a cross-encoder over the question paired with each candidate to score actual relevance properly, and keep the best handful. The retrieval finds plausibly-relevant chunks fast; the reranker does the careful, expensive judgement on a small set. It costs a little latency and it's worth every millisecond, because it's the step that distinguishes "mentions the topic" from "answers the question," which was my whole problem from the start.

what I'd tell past me

Spend your effort on the boring half. The glamour is all in the model and the prompt, and that's the part that mostly just works. The grind is in chunking, retrieval, hybrid search, reranking, and being honest about when there's no answer. RAG is mostly a search problem wearing a language-model hat, and my first attempt failed because I treated it as a language-model problem with a search step bolted on. Get the search right and the model will reward you. Get the search wrong and it will lie to you beautifully, which is so much worse than failing loudly.