8 min readRishi

Designing a Production-Ready RAG System: What Actually Matters

There is a gap between "a RAG demo that works on the PDFs you cherry-picked" and "a RAG system your support team uses every day." The demo is maybe 200 lines of LangChain. The production version is a multi-team engineering project. What fills the gap is not a better LLM — it is decades of classical IR work plus an evaluation discipline most AI projects skip.

This post is the concrete version of that gap. No framework advocacy, no "the future is agentic" — just the pieces you end up building when the stakes go up.

The Five-Stage Pipeline

Most production RAG systems collapse to roughly this shape:

  1. Ingestion — take raw documents, chunk them, produce embeddings, index.
  2. Retrieval — given a query, fetch candidate chunks. Usually hybrid: dense + sparse.
  3. Reranking — re-score the top N candidates with a stronger model to produce top K.
  4. Generation — compose a prompt with the top K chunks and ask the LLM to answer.
  5. Evaluation — measure whether the answer was actually any good.

The demo skips stages 3 and 5. Skipping them is why the demo works and the production system does not.

Chunking Is Where Most Problems Start

The naive approach is a fixed token window — split every 500 tokens, maybe overlap 50. This works for uniformly-structured documents and fails on everything else.

Two observations that change your chunking strategy:

Structure beats uniformity. Markdown docs should chunk at heading boundaries; code files at function boundaries; legal documents at clause boundaries. Use the document's own structure to decide where a chunk ends. A MarkdownHeaderTextSplitter or the equivalent for your format produces dramatically better retrieval than a blind character count.

Context should travel with the chunk. A chunk that says "Step 3: connect the database" is useless without knowing what the steps are for. I attach a compact context header to every chunk: document title, the heading path, the previous heading's first sentence. Retrieval surfaces the raw chunk; the LLM sees the chunk plus its breadcrumbs.

def enrich_chunk(chunk_text: str, headings: list[str], doc_title: str) -> str:
    header = f"Document: {doc_title}\nSection: {' > '.join(headings)}\n\n"
    return header + chunk_text

This one change — attaching breadcrumbs — reduced our "answer is technically from the right doc but addresses the wrong product" errors by roughly 40%.

Hybrid Retrieval: Dense Alone Is Not Enough

Dense vector retrieval is great at semantic similarity ("how do I reset my password?" finds "password recovery procedure") and bad at exact matches ("error code E-4417" finds "error handling guide" and misses the actual E-4417 page).

Sparse retrieval — BM25, plain old inverted index — is the opposite: great at exact matches, blind to synonyms.

Production systems do both. Retrieve top N via dense, top N via BM25, and merge. The cleanest merge I have used is Reciprocal Rank Fusion:

def rrf(dense_results, sparse_results, k=60):
    scores = {}
    for rank, chunk_id in enumerate(dense_results):
        scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank)
    for rank, chunk_id in enumerate(sparse_results):
        scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

Two lines of real work, a massive uplift. Anytime someone tells me their retrieval is "not finding exact matches," the fix is almost always "add BM25."

Rerank: The Free Win Nobody Uses

After hybrid retrieval, you have 50-100 candidate chunks. The prompt only has room for maybe 5-10. Which do you pick? The naive answer is "the top 5 from the merged ranking." The better answer is to run a reranker — a cross-encoder that takes the query and each candidate and produces a relevance score.

Reranking is much more accurate than embedding cosine similarity, because it actually reads the chunk in the context of the query rather than comparing two vectors that were produced independently. The cost is per-chunk inference, but only on the 50-100 candidates, not the whole corpus.

Cohere's reranker, Voyage AI's reranker, and the open-source BGE reranker are all reasonable picks. Integrating one is maybe 10 lines of code and typically moves answer quality more than any prompt tuning ever will.

Prompt Construction: Be Boring

Take the top K chunks. Render them in a consistent format. Tell the model to answer using only those chunks. Say what to do if the answer is not there. Do not get clever.

You are answering a user's question using the context below.
Only use facts from the context. If the answer is not in the context,
say you don't know.

<context>
{chunk_1}

{chunk_2}

{chunk_3}
</context>

<question>
{user_question}
</question>

Three things to include that people forget:

  • Source markers. Tag each chunk with an identifier so the model can cite sources, and so you can verify citations programmatically.
  • An explicit fallback. "If the answer is not in the context, say you don't know." Without this, the model will confidently hallucinate from its pretraining.
  • A scope boundary. "Only use facts from the context." Without this, "what year was X released?" triggers the model to fill in gaps from its own knowledge.

The Evaluation Layer Is Not Optional

Most RAG systems I audit have no evaluation. The team knows "it works most of the time" but cannot tell you which changes made it better or worse.

You need three evaluation flavors:

Retrieval evaluation. Given a set of question+correct-chunk pairs, does your retrieval return the correct chunk in the top K? Measure recall@K. Aim for >90% before worrying about anything else.

End-to-end evaluation. Given a set of question+expected-answer pairs, does the full pipeline produce a correct answer? Measure with an LLM-as-judge against a reference answer. This catches failure modes that retrieval evaluation misses — correct chunk retrieved but poorly synthesized.

Regression evaluation. Run the full eval suite every time a prompt, model, or retrieval component changes. A 2% regression should fail the PR.

An initial eval set is 50-100 question+answer pairs you write by hand. It takes an engineer a day, and it repays the time many times over the project's life. "We cannot release this — eval regression" is a conversation that improves the system; "we think it's better, we're not sure" is one that does not.

Chunk Metadata: Do It Now

Every chunk in your index should carry structured metadata: document type, product, section, last-updated date, permission scope. The reason is not retrieval — the reason is filtering.

You will eventually want to answer questions like:

  • "Only search product X's docs."
  • "Only search things updated in the last 6 months."
  • "Only search documents this user has permission to see."

You cannot add that metadata later without reindexing. Put the fields in from day one even if you are not filtering yet.

Caching Answers

For RAG systems that serve repeated questions — internal help desks, documentation search — semantic answer caching is a large cost and latency win. Hash the embedding of the question (rounded to a bucket), and if a recent cached answer exists, return it.

Two gotchas: invalidate when source documents change, and only cache answers where the retrieved chunks had high reranker scores. Caching a "I don't know" answer is worse than not caching.

The Failure Modes Nobody Mentions at Conferences

Retrieval is biased toward longer chunks. Shorter chunks have less surface area for keywords and embeddings, so they under-retrieve. If you have a mix of long and short source docs, you need to normalize or your FAQ entries will lose to your 20-page whitepapers.

Embedding models drift. The model you used to embed last year's corpus may not match the one you want to use for new queries. Pick an embedding model you are willing to commit to, and reindex when you change it.

Freshness is not free. If documents change, the index must reindex. Streaming ingestion is more complex than batch. A nightly rebuild is the simpler pattern and usually enough; do not build streaming ingestion until you know you need it.

Access control matters. Users should only retrieve chunks they are allowed to see. Do this at query time via metadata filters — not after, because a retrieval that returns 10 chunks, filters 8, and sends 2 to the LLM is already a data leak risk.

Hallucinated citations. The model will sometimes cite chunks that exist but do not support the claim. Verify citations programmatically — parse the citation, look up the chunk, confirm the answer is actually in it. This is cheap and catches a real class of subtle errors.

What Production-Grade Looks Like

When I walk into a mature RAG system, the signals I look for:

  • Chunking respects document structure
  • Hybrid retrieval (dense + sparse), RRF merge
  • A reranker between retrieval and generation
  • Metadata on every chunk, used for filtering
  • An evaluation suite with retrieval and end-to-end metrics
  • Citations that the answer actually supports, verified
  • Caching where it makes sense, with invalidation

A system with six of these seven is serving real users. A system with two of them is a demo that found its way into prod, and the team is in firefighting mode.

You do not need to build all seven on day one. You do need to know which ones you are skipping and why.

Keep reading

Newsletter

New posts, straight to your inbox

One email per post. No spam, no tracking pixels, unsubscribe anytime.

Comments

No comments yet. Be the first.