Visualize exactly how your chunking strategy splits documents — with overlap regions highlighted and per-chunk token estimates
Chunks produced
0
Avg chunk size
0
~tokens
Total ~tokens
0
Overlap waste
0%
duplicated text
Chunks
overlap regions highlighted
Chunks appear here as you type…
RAG Chunk Inspector lets you visualize exactly how a chunking strategy splits a document before you commit to running it against a real corpus. Paste any text, pick one of three strategies, set chunk size and overlap, and see every resulting chunk rendered separately with token estimates, character counts, and overlap zones highlighted in color. No document leaves your browser.
Naïve approach: cut every N characters with M characters of overlap. Fast, deterministic, ignores semantic boundaries. Splits sentences and even words mid-token. Use this as a baseline to see how much worse it is than the alternatives — often worse than you expect.
The standard approach used by LangChain, LlamaIndex, and most RAG frameworks. Tries to split on paragraph breaks (\n\n) first; if a chunk is still too big, falls back to sentence boundaries (. ! ?); finally word boundaries. Preserves semantic units when possible. This is the right default for general-purpose prose.
Splits text into sentences first, then groups sentences greedily until each group is just under the chunk size. Preserves sentence integrity 100% of the time but produces variable-size chunks. Good for dialogue transcripts, FAQ content, and any corpus where sentence boundaries are high-signal.
Chunk size is the most impactful tunable parameter in a RAG pipeline, and it has no universally correct value. Here’s the tradeoff clearly:
Too small (e.g., 128 tokens):
Too large (e.g., 2048 tokens):
The empirically useful starting range is 256–1024 tokens. Most production RAG systems live in this range, with 512 tokens being the most common starting point.
Chunking creates artificial boundaries in continuous text. A fact that spans the boundary between chunk 47 and chunk 48 is invisible to retrieval: chunk 47 ends mid-sentence and chunk 48 starts without context. The retrieval system will never return a relevant answer for a query about that fact unless it lands solidly inside a single chunk.
Overlap is the mitigation: each chunk includes N tokens from the end of the previous chunk. This ensures that boundary-spanning content appears fully in at least one chunk. The cost is redundant storage and embedding computation for the overlapping portions.
A 10% overlap (50 tokens of overlap on a 512-token chunk) is the practical minimum. Going above 25% produces diminishing returns — you’re mostly duplicating context that the embedding model will learn to treat as background.
Embedding models have an optimal input length — a sweet spot where the input is long enough to have rich semantic content but not so long that the single embedding vector is a blurry average of too many concepts. For most production embedding models (OpenAI text-embedding-3, Cohere embed-v3, BGE), this optimal range is roughly 256–512 tokens.
A chunk that perfectly captures one coherent thought — one section of a document, one FAQ question + answer, one code function with its docstring — will embed closer to queries that ask about that thought, and farther from queries that don’t. That’s what “good retrieval” means at the embedding layer.
This is why recursive splitting, which tries to preserve paragraph and sentence boundaries, tends to outperform fixed-size character splitting: it produces chunks that align with natural semantic boundaries, which align better with how the embedding model learned to represent meaning.
| Use case | Strategy | Chunk size | Overlap |
|---|---|---|---|
| General prose (docs, articles) | Recursive | 512 tokens | 50 tokens (10%) |
| Code documentation | Recursive | 1024 tokens | 100 tokens |
| Long-form articles | Sentence-aware | 768 tokens | 1 sentence |
| Tabular / structured data | Fixed-size | 256 tokens | 0 |
| Conversational logs | Recursive on speaker turns | 512 tokens | 1 turn |
| Legal / medical text | Recursive | 512–768 tokens | 75 tokens |
These are starting points. Paste your actual data and tune from there.
This tool sits in the pre-indexing stage of your pipeline:
Document ingestion
↓
[RAG Chunk Inspector] ← tune here
↓
Chunk embedding (text-embedding-3, etc.)
↓
Vector store indexing (Chroma, Pinecone, Weaviate, pgvector)
↓
Query → embedding → ANN search → top-k chunks → LLM synthesis
The chunk strategy and size you pick here directly determines the shape of your embedding space and the quality of retrieval at query time. Getting this wrong doesn’t produce an error — it produces subtle quality degradation that’s hard to diagnose post-hoc. The correct time to tune it is here, before indexing.
For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.