See exactly how an LLM splits your text into tokens - color-coded, byte-counted, model-aware
Large input (>50 KB) — visualization rendering capped at first 2,000 tokens. The total counts above remain accurate.
OpenAI
0
GPT-4 / 4o / o1 / o3
Anthropic
0
Claude 3 / 3.5 / 4
Meta Llama
0
Llama 3 / 3.1 / 3.2
0
Gemini 1.5 / 2.0
Token visualization
Token type breakdown
As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.
Tokenizer Visualizer shows you how LLMs actually split your text into tokens - the unit models count, bill by, and process. Paste any text and see color-coded token chips across four model families: OpenAI, Anthropic, Meta Llama, and Google Gemini. Each chip shows token content, byte count, and character count. A summary panel shows total token counts per family side by side.
Color legend
The four tokenizer families covered
| Family | Models |
|---|---|
| OpenAI cl100k / o200k | GPT-4, GPT-4o, GPT-4-turbo, o1, o3 |
| Anthropic | Claude 3, 3.5, 3.7, 4 (Sonnet/Opus/Haiku) |
| Meta Llama | Llama 3, 3.1, 3.2, 3.3 |
| Google Gemini | Gemini 1.5 Pro/Flash, 2.0 Flash |
This is a BPE pre-tokenization visualizer. Real tokenization is a two-step process: (1) pre-tokenization splits text into candidate token boundaries using a regex pattern, (2) BPE merges run through the vocabulary to combine adjacent fragments into learned merge pairs. This tool runs step 1 - the regex pre-split - which is accurate to ~95% for English prose.
The cl100k_base regex pattern used as the pre-tokenizer:
(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+
This pattern captures contractions, word tokens, numeric tokens (up to 3 digits at a time), punctuation with optional leading space, and whitespace. The pattern is the same one used internally by tiktoken’s cl100k_base tokenizer - so the split boundaries shown are real, not approximations.
Per-family counts are derived from calibrated multipliers against the cl100k baseline: Anthropic’s tokenizer is approximately 5% denser; Llama uses SentencePiece BPE with a 128k vocabulary that handles common subword patterns slightly differently; Gemini uses a proprietary BPE with sentence-piece preprocessing. For exact counts, use the model’s native tokenizer.
Understanding tokenization changes how you write prompts. Here are concrete examples:
Token-efficient writing:
"Summarize:" → 2 tokens"Please provide a comprehensive summary of the following text:" → 11 tokensToken-wasteful patterns:
https://example.com/v1/api/endpoint?param=value&other=thing is 15+ tokens--- dividers, ### headers, repeated **bold** markers all cost tokensToken-efficient rewriting:
1. is 1 token, - is also 1 but numbered lists are structurally clearer<instruction> is 4 tokens, cleaner than freeform labelingThe context window as a budget: A 200K context window sounds enormous, but consider: a 50-turn conversation with 500-token messages per turn is 50,000 tokens - 25% of your budget, before any documents are added. Understanding token density helps you make informed decisions about what to include vs. summarize vs. drop.
Non-English text - non-Latin scripts tokenize less efficiently than English. Japanese and Chinese characters often tokenize to multiple bytes per character with BPE vocabularies trained predominantly on English data. A 100-character Chinese sentence may produce 200+ tokens, while a 100-character English sentence produces 25–35. This has real cost and context implications for non-English applications.
Code - code tokenizes differently from prose. Identifiers tokenize well (get_user, processEvent), but operators, brackets, and indentation tokenize as individual characters or small pairs. A Python function with significant indentation may tokenize 30–40% less efficiently than equivalent prose. Minified code is actually worse - it packs more semantic content into fewer characters but those characters often tokenize at 1:1.
Mathematical notation - LaTeX math expressions are extremely token-inefficient. \frac{\partial f}{\partial x} is 12+ tokens for what’s semantically one derivative symbol. For math-heavy prompts, consider using Unicode math symbols (∂, ∑, ∫) instead of LaTeX - they often tokenize as single tokens.
Special characters and emoji - emoji tokenize as multiple bytes. The thumbs-up emoji 👍 is 4 bytes in UTF-8 and typically tokenizes as 1–2 tokens. For prompts with high emoji density (social media analysis, chat data), this adds up.
Repeated patterns - BPE merges common patterns into single tokens. the is 1 token, The is 1 token, THE may be 1–2 tokens depending on the vocabulary. Common words in common casing are cheap; rare words or unusual casing are expensive.
This is a pre-tokenization visualizer, not the full BPE merge pipeline. Token boundaries shown are accurate to ~95% for English prose, ±10% for total count, ±15–20% for code or non-Latin scripts.
For exact counts, use:
tiktoken library or the Tokenizer at platform.openai.comanthropic.count_tokens() API methodtransformers tokenizer for the specific model checkpoint<|im_start|>, <|endoftext|>, BOS/EOS tokens are not rendered separatelyFor informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.