my first rag pipeline was confidently useless

A stylised robot head representing an AI assistant

I built a retrieval-augmented question answering thing over our internal docs, demoed it to myself, and it was confidently, fluently wrong. Not hallucinating from nothing, which would at least be a known failure. It was citing real documents and drawing the wrong conclusions from them, which is worse, because it looks like it's working right up until someone trusts it.

The pitch for RAG is straightforward and genuinely good: instead of relying on what a model memorised, you fetch relevant chunks from your own corpus at query time and stuff them into the prompt. The model answers from material you control, with citations. The hard part, which nobody's slide deck mentions, is the word "relevant". My first attempt got that word wrong in three separate places.

chunking by character count is a trap

My first sin was the most common one. I split every document into 1000-character chunks with a 100-character overlap, embedded each chunk, done. It's the example everyone copies, and it shreds your documents at meaningless boundaries.

A runbook step would get cut in half. The heading "Rollback procedure" landed in one chunk and the actual procedure in the next, so a query about rollback retrieved the heading and three paragraphs of unrelated preamble. A table got chopped mid-row into nonsense. The embeddings were faithful representations of fragments that meant nothing on their own.

The fix was to chunk on structure, not character count. I split on Markdown headings first, kept each section whole where it fit, and only fell back to size-based splitting inside sections that were genuinely too long. I prepended the document title and the heading path to each chunk before embedding, so a chunk carries its own context:

[Deploys > Rollback procedure]
To roll back, run `deploy rollback --to <sha>` ...

That one change, embedding the chunk together with where it lives, did more for retrieval quality than anything else I tried.

A diagram of a document being split into overlapping chunks

the embedding model has opinions

My second sin was assuming all embedding models are roughly interchangeable. I'd grabbed a small, fast one because it was free and local, and its sense of "similar" was coarser than I needed. Queries phrased as questions retrieved chunks that shared surface words rather than meaning. "How do I revoke a token?" pulled back a chunk about issuing tokens, because both are dense with the word token.

Two things helped. First, an embedding model actually trained for retrieval, where the query and the passage are encoded with that asymmetry in mind, rather than a general-purpose similarity model. The difference in retrieval hit rate was not subtle. Second, and this is the bit I should have done from the start, I stopped trusting the top-k vector results as final.

retrieve wide, then rerank

Vector search is a fast, approximate first pass. It's good at getting the right chunk somewhere into the top 50. It's mediocre at getting it into the top 3, which is all you can afford to put in the prompt. So the pattern that finally worked was two stages: retrieve a wide net of candidates by vector similarity, then run a cross-encoder reranker over the query-and-candidate pairs to actually score relevance, and keep only the best handful.

candidates = vector_store.search(query, k=50)
scored = reranker.score(query, [c.text for c in candidates])
top = [c for c, _ in sorted(zip(candidates, scored),
                            key=lambda x: x[1], reverse=True)][:5]

A reranker is slower per pair because it reads the query and the passage together rather than comparing two precomputed vectors, but it only runs on 50 candidates, not the whole corpus, so the cost is bounded and small. The quality jump was the difference between a toy and something I'd let a colleague use.

tell the model it's allowed to not know

The last fix was in the prompt, not the pipeline. The model, given chunks, would always produce a confident answer even when the chunks didn't contain one. So I told it plainly: answer only from the provided context, cite the source for each claim, and if the context doesn't contain the answer, say so and stop. "I don't have that in the docs" is a correct and useful answer. A fluent paragraph assembled from adjacent-but-wrong material is a trap with a bow on it.

evaluate retrieval on its own, before the model gets involved

The mistake that cost me the most time was conflating two failures. When the system gave a wrong answer, I couldn't tell whether the retrieval had handed the model the wrong chunks, or whether the retrieval was fine and the model had fumbled good material. Those need completely different fixes, and I was trying to tune both at once by squinting at final answers, which is hopeless.

The cure was to build a small evaluation set and test retrieval in isolation. I wrote down maybe forty real questions people actually ask, and for each one noted which document, ideally which chunk, contains the answer. Then I ran retrieval alone and measured: for each query, did the correct chunk appear in the top results at all, and if so, how high. No language model in the loop, just the question, the index, and a number.

def recall_at_k(eval_set, retrieve, k=5):
    hits = 0
    for q, gold_doc in eval_set:
        results = retrieve(q, k=k)
        if any(r.doc_id == gold_doc for r in results):
            hits += 1
    return hits / len(eval_set)

Suddenly I had a dial I could read. Chunking on structure moved recall@5 up sharply. Swapping the embedding model moved it again. Adding the reranker moved it again. Each change was a number going up or down, not a vibe about whether the answers "felt better". That's the difference between engineering and wishful thinking, and I'd skipped straight past it in my hurry to demo something.

Only once retrieval was reliably putting the right chunk in front of the model did the remaining errors become genuinely the model's fault, and those were the easy ones to fix with prompt changes. If I'd had this harness from the start I'd have saved myself the entire confused middle of the project, where I kept changing the prompt to fix what was actually a chunking problem.

where it landed

None of this is exotic. There's no clever model in here, just a string of unglamorous decisions about how text gets cut up, encoded, ranked and framed. That's the honest shape of RAG: the retrieval is the whole game, and the generation is the easy bit that gets all the attention.

If I were starting again I'd build the retrieval half first and evaluate it on its own, with a set of real questions and the chunks I know should come back, before letting a language model anywhere near it. Get the right text in front of the model and a decent model will do something sensible with it. Get the wrong text in front of it and you've built a machine for being wrong in complete sentences.