Paste a conversation. See exactly what fits in 8K / 32K / 128K / 200K windows — and which truncation strategy preserves the most
Context window fit
Message stack (sized by tokens)
system user assistant
Truncation simulation
Context Window Visualizer takes a multi-turn conversation and shows you exactly how it fills different context windows. Each message is sized proportionally by its token count and color-coded by role. Four truncation strategies are simulated — showing which messages survive when the conversation exceeds a given window size. The goal is to make context window management a visual exercise, not a guesswork one.
Input formats accepted:
Plain text:
SYSTEM: You are a helpful assistant.
USER: What's the capital of France?
ASSISTANT: Paris.
USER: And of Germany?
JSON (ChatML):
[
{ "role": "system", "content": "You are..." },
{ "role": "user", "content": "What's..." },
{ "role": "assistant", "content": "Paris." }
]
| Strategy | What gets cut | Best for |
|---|---|---|
| Drop oldest | Earliest messages first (after system prompt) | Simple chatbots where recency dominates |
| Summarize prefix | Replace early turns with a compressed summary | Long agent loops with consistent goals |
| Sliding window (keep last N) | Everything except last N turns + system prompt | Conversation where only recent context matters |
| Keep first + last | Middle turns; preserves early examples + recent history | Few-shot prompts in early turns |
Each strategy shows the resulting token total and which messages were kept vs. dropped. There’s no universally correct strategy — the right choice depends on what information is most important to preserve.
This is one of the most consistently underestimated challenges in long-running LLM applications. Here’s the actual scale of the problem with real numbers:
200K tokens sounds enormous. Claude 3.5 Sonnet has a 200K context window. That sounds like you can fit everything. But consider a real long-running agent loop:
At that rate, 200K tokens = approximately 166 turns. A complex multi-step task that runs 200 agent turns will hit the context limit. And that’s before you consider that many real agent loops include much heavier messages — file contents, API responses, search results, code outputs.
In practice, most production agents start hitting context pressure around turn 50–100. The question is not “will I hit the limit?” but “what happens when I do, and how do I design for it?”
When you hit the context limit, something has to be dropped. The default behavior in most frameworks and APIs is “drop oldest messages first.” This default is frequently wrong, and understanding why matters for agent design.
What drop-oldest destroys:
When these are dropped, the agent starts generating responses that drift from its original behavior, lose track of established facts, and eventually confabulate. This looks like model degradation or prompt sensitivity, but it’s actually context management failure.
Summarize-prefix is the best general approach for agents with consistent long-term goals. The strategy: when context exceeds the limit, call the model one more time with the early turns and the instruction “summarize the progress so far, key decisions made, and current state in 500 tokens.” Replace those early turns with the summary. The agent retains a compressed version of its history and continues with context to spare.
The downside: summarization is a separate LLM call (latency + cost). For latency-sensitive applications, sliding window or keep-first+last may be acceptable tradeoffs.
Keep-first+last is the right strategy when early turns contain high-value, long-lived information (detailed instructions, few-shot examples, reference data) and recent turns contain the active task state. Middle turns — often transitional back-and-forth — are safely dropped.
| Content type | Tokens per unit | 128K window fills in |
|---|---|---|
| Average chatbot exchange (user+assistant) | ~200 tokens | 640 exchanges |
| Agent loop with tool calls | ~1,200 tokens/turn | ~107 turns |
| Agent with file attachment (~5 pages) | ~4,000 tokens/turn | ~32 turns |
| Agent with code generation + review | ~3,000 tokens/turn | ~43 turns |
| RAG with top-3 retrieved chunks (512 tokens each) | ~2,000 tokens per query | ~64 queries |
128K context is generous for interactive chat. For agents doing real work — reading files, generating code, making tool calls — 128K fills in 30–100 turns. Plan for context pressure from the design phase, not as an afterthought.
tool role as assistant for token accounting purposesFor informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.