June 21, 20267 min readRishi

Chunking Strategies for RAG: Where Retrieval Quality Is Won or Lost

When a retrieval-augmented generation system gives wrong or vague answers, the instinct is to blame the model or the embedding. So teams swap in a bigger embedding model, add a reranker, tune the prompt — and the answers barely improve. The real culprit is usually upstream of all of that, in the least glamorous step of the pipeline: how the documents were split into chunks.

Chunking decides what units of text can ever be retrieved. If the right answer is spread across two chunks, or buried in a chunk full of unrelated text, no embedding model or reranker can save you — the information your system needs simply isn't retrievable as a clean unit. Chunking is where retrieval quality is silently won or lost, and it deserves far more attention than it usually gets.

Why you chunk at all

You can't embed an entire 80-page document as one vector and expect useful retrieval. Embeddings compress meaning into a fixed-size vector, and the more text you cram in, the more the specific details blur into an averaged-out mush. A query about one precise fact won't match a vector that represents the "average meaning" of forty pages.

So you split documents into smaller pieces, embed each, and retrieve the pieces most similar to the query. The chunk is the atom of retrieval. That makes the chunking decision foundational: a chunk is the smallest thing your system can find, and the largest thing it has to find it in. Too big and the signal is diluted; too small and you lose the context that makes the text meaningful.

The naive baseline and why it fails

The simplest approach: split every N characters. It's trivial to implement and almost always wrong, because it slices with no regard for meaning:

def fixed_chunks(text, size=1000):
    return [text[i:i+size] for i in range(0, len(text), size)]

This cuts sentences in half, splits a table from its header, and severs a "therefore" from the reasoning it concludes. A chunk that ends mid-sentence embeds poorly because it's semantically incomplete — the vector represents a fragment, not an idea. You'll retrieve a chunk that starts relevant and trails off into the next topic, or one that begins mid-thought with no anchor.

The first real improvement is to respect structure. Split on natural boundaries — paragraphs, then sentences — and only fall back to hard character limits when a single unit is too big. This is what "recursive character splitting" does: try to split on \n\n, then \n, then . , then characters, preferring the most semantically meaningful boundary that fits your size budget. It's a strict upgrade over fixed-size for almost no extra effort.

Overlap: the cheap fix for boundary loss

Even with smart boundaries, information near a chunk edge gets orphaned. A definition at the end of one chunk and its use at the start of the next are now in separate vectors, and a query needing both might match neither well.

The standard mitigation is overlap: each chunk repeats the last bit of the previous one.

def overlapping_chunks(text, size=1000, overlap=150):
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start:start + size])
        start += size - overlap
    return chunks

A 10–20% overlap means a fact sitting on a boundary appears, intact, in at least one chunk. The cost is some duplicated text and a slightly larger index. It's one of the highest-return, lowest-effort knobs in RAG — turn it on. The right amount depends on content; dense technical text benefits from more overlap than loose prose.

Match the strategy to the document

There is no universal chunk size, because "the right unit of meaning" depends entirely on what you're chunking. This is the judgment that separates a tuned RAG system from a generic one:

Structured prose (docs, articles, wikis). Chunk by heading and section. The document's own structure is the semantic boundary — the author already grouped related ideas under headings. Carry the heading hierarchy into each chunk so "this paragraph" knows it's under "Section 3.2: Refund Policy."
Code. Split on function and class boundaries, not line counts. Half a function is nearly useless context. Tree-sitter or a language-aware splitter beats character counting by a mile here.
Tables and spreadsheets. Keep rows with their headers. A row of numbers with no column names is noise. Sometimes the right move is to convert each row to a sentence ("In Q3, the Northeast region had revenue of...") so it embeds with meaning.
Q&A, FAQs, transcripts. Keep each question with its answer, each speaker turn intact. The natural unit is the exchange, not an arbitrary span.
Long-form reasoning (legal, scientific). Larger chunks, because the meaning depends on extended context. Splitting an argument from its premises destroys it.

The meta-point: let the document's natural structure define your chunks whenever it can. Authors already did the work of grouping related ideas — headings, functions, rows, turns. Fighting that structure with blind character counts throws away free signal.

Decouple what you embed from what you return

A powerful and underused idea: the chunk you search over doesn't have to be the chunk you give the model.

Small-to-big. Embed small, precise chunks (a sentence or two) for sharp retrieval, but when one matches, return the larger surrounding section to the LLM so it has full context. You get the precision of small chunks and the completeness of large ones.
Summary/hypothetical-question indexing. For each chunk, generate a short summary or a few questions it could answer, embed those, and link back to the original chunk. The query "what's the refund window?" matches the generated question "How long do customers have to request a refund?" far better than it matches dense policy prose.
Contextual chunks. Prepend a one-line description of the chunk's place in the document ("From the 2025 employee handbook, section on parental leave:") before embedding. This restores the context that chunking stripped away, so an isolated chunk doesn't lose its meaning. This technique alone can meaningfully cut retrieval failures.

These tricks exploit the same insight: optimize the embedded representation for matching and the returned representation for answering. They're different jobs, and forcing one chunk to do both is a compromise you don't have to make.

How to actually tune it

Chunking is empirical, not theoretical — you can't reason your way to the right strategy, you have to measure it. The mistake is shipping a chunk size you picked by gut and never revisiting it.

Build a small evaluation set of real questions paired with the document passages that should answer them. Even 30–50 examples is enough to see signal.
Measure retrieval directly. Before judging the final answer, check: did the right chunk make it into the top-k retrieved results? If the answer isn't even being retrieved, no amount of prompt engineering downstream will fix it. This is the metric that actually moves quality.
Sweep the knobs — size, overlap, structure-aware vs. fixed, small-to-big — against that eval set and compare retrieval hit rate.
Then look at end-to-end answer quality, having ruled out retrieval as the bottleneck.

Most RAG debugging should start at retrieval, and most retrieval problems are chunking problems. Before you reach for a bigger model or a fancier reranker, ask the unglamorous question: are my chunks even the right shape for the questions people ask? More often than not, fixing the chunks is the single highest-leverage change you can make — and it's the one teams skip because splitting text feels too simple to be the thing that matters. It is exactly the thing that matters.

SharePost Share

Keep reading

Jun 25, 20267 min read

Evaluating LLM Applications: Getting Past 'It Looks Good to Me'

Shipping LLM features on vibes works until a prompt tweak silently breaks ten other cases. Here is how to build evals that catch regressions before your users do.

ai llm tutorial

Jun 23, 20267 min read

Getting Structured Output From an LLM Without the Heartbreak

Asking a model to 'return JSON' and parsing the result is how you get 3 a.m. pages. Tool schemas, constrained decoding, and validation turn a probabilistic text generator into a reliable API.