See exactly how an LLM splits your text into tokens — color-coded, byte-counted, model-aware
Large input (>50 KB) — visualization rendering capped at first 2,000 tokens. The total counts above remain accurate.
OpenAI
0
GPT-4 / 4o / o1 / o3
Anthropic
0
Claude 3 / 3.5 / 4
Meta Llama
0
Llama 3 / 3.1 / 3.2
0
Gemini 1.5 / 2.0
Token visualization
Token type breakdown
Tokenizer Visualizer shows you how LLMs actually split your text into tokens — the unit models count, bill by, and process. Paste any text and see color-coded token chips across four model families: OpenAI, Anthropic, Meta Llama, and Google Gemini. Each chip shows token content, byte count, and character count. A summary panel shows total token counts per family side by side.
Color legend
The four tokenizer families covered
| Family | Models |
|---|---|
| OpenAI cl100k / o200k | GPT-4, GPT-4o, GPT-4-turbo, o1, o3 |
| Anthropic | Claude 3, 3.5, 3.7, 4 (Sonnet/Opus/Haiku) |
| Meta Llama | Llama 3, 3.1, 3.2, 3.3 |
| Google Gemini | Gemini 1.5 Pro/Flash, 2.0 Flash |
This is a BPE pre-tokenization visualizer. Real tokenization is a two-step process: (1) pre-tokenization splits text into candidate token boundaries using a regex pattern, (2) BPE merges run through the vocabulary to combine adjacent fragments into learned merge pairs. This tool runs step 1 — the regex pre-split — which is accurate to ~95% for English prose.
The cl100k_base regex pattern used as the pre-tokenizer:
(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+
This pattern captures contractions, word tokens, numeric tokens (up to 3 digits at a time), punctuation with optional leading space, and whitespace. The pattern is the same one used internally by tiktoken’s cl100k_base tokenizer — so the split boundaries shown are real, not approximations.
Per-family counts are derived from calibrated multipliers against the cl100k baseline: Anthropic’s tokenizer is approximately 5% denser; Llama uses SentencePiece BPE with a 128k vocabulary that handles common subword patterns slightly differently; Gemini uses a proprietary BPE with sentence-piece preprocessing. For exact counts, use the model’s native tokenizer.
Understanding tokenization changes how you write prompts. Here are concrete examples:
Token-efficient writing:
"Summarize:" → 2 tokens"Please provide a comprehensive summary of the following text:" → 11 tokensToken-wasteful patterns:
https://example.com/v1/api/endpoint?param=value&other=thing is 15+ tokens--- dividers, ### headers, repeated **bold** markers all cost tokensToken-efficient rewriting:
1. is 1 token, - is also 1 but numbered lists are structurally clearer<instruction> is 4 tokens, cleaner than freeform labelingThe context window as a budget: A 200K context window sounds enormous, but consider: a 50-turn conversation with 500-token messages per turn is 50,000 tokens — 25% of your budget, before any documents are added. Understanding token density helps you make informed decisions about what to include vs. summarize vs. drop.
Non-English text — non-Latin scripts tokenize less efficiently than English. Japanese and Chinese characters often tokenize to multiple bytes per character with BPE vocabularies trained predominantly on English data. A 100-character Chinese sentence may produce 200+ tokens, while a 100-character English sentence produces 25–35. This has real cost and context implications for non-English applications.
Code — code tokenizes differently from prose. Identifiers tokenize well (get_user, processEvent), but operators, brackets, and indentation tokenize as individual characters or small pairs. A Python function with significant indentation may tokenize 30–40% less efficiently than equivalent prose. Minified code is actually worse — it packs more semantic content into fewer characters but those characters often tokenize at 1:1.
Mathematical notation — LaTeX math expressions are extremely token-inefficient. \frac{\partial f}{\partial x} is 12+ tokens for what’s semantically one derivative symbol. For math-heavy prompts, consider using Unicode math symbols (∂, ∑, ∫) instead of LaTeX — they often tokenize as single tokens.
Special characters and emoji — emoji tokenize as multiple bytes. The thumbs-up emoji 👍 is 4 bytes in UTF-8 and typically tokenizes as 1–2 tokens. For prompts with high emoji density (social media analysis, chat data), this adds up.
Repeated patterns — BPE merges common patterns into single tokens. the is 1 token, The is 1 token, THE may be 1–2 tokens depending on the vocabulary. Common words in common casing are cheap; rare words or unusual casing are expensive.
This is a pre-tokenization visualizer, not the full BPE merge pipeline. Token boundaries shown are accurate to ~95% for English prose, ±10% for total count, ±15–20% for code or non-Latin scripts.
For exact counts, use:
tiktoken library or the Tokenizer at platform.openai.comanthropic.count_tokens() API methodtransformers tokenizer for the specific model checkpoint<|im_start|>, <|endoftext|>, BOS/EOS tokens are not rendered separatelyFor informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.