gekro
GitHub LinkedIn
AI · Free Tool

Tokenizer Visualizer

See exactly how an LLM splits your text into tokens — color-coded, byte-counted, model-aware

Data verified: 2026-04-20
0 chars · 0 bytes

OpenAI

0

GPT-4 / 4o / o1 / o3

Anthropic

0

Claude 3 / 3.5 / 4

Meta Llama

0

Llama 3 / 3.1 / 3.2

Google

0

Gemini 1.5 / 2.0

0 visualized tokens

Token visualization

Tokens appear here as you type…
Word Number Symbol Whitespace Special
© 2026 Rohit Burani · MIT · Built at gekro.com · View source ↗

Guide

What It Does

Tokenizer Visualizer shows you how LLMs actually split your text into tokens — the unit models count, bill by, and process. Paste any text and see color-coded token chips across four model families: OpenAI, Anthropic, Meta Llama, and Google Gemini. Each chip shows token content, byte count, and character count. A summary panel shows total token counts per family side by side.

How to Use It

  1. Paste text into the input area — a system prompt, a user message, a block of code, anything.
  2. Token chips appear immediately, color-coded by type.
  3. Check the per-family token totals in the summary panel.
  4. Experiment: edit words, add whitespace, swap synonyms — see how the count changes in real time.

Color legend

  • Word (blue) — alphabetic sequences
  • Number (yellow) — numeric sequences
  • Symbol (purple) — punctuation, brackets, operators
  • Whitespace (brown) — spaces, tabs, newlines
  • Special (red) — non-printable or control characters

The four tokenizer families covered

FamilyModels
OpenAI cl100k / o200kGPT-4, GPT-4o, GPT-4-turbo, o1, o3
AnthropicClaude 3, 3.5, 3.7, 4 (Sonnet/Opus/Haiku)
Meta LlamaLlama 3, 3.1, 3.2, 3.3
Google GeminiGemini 1.5 Pro/Flash, 2.0 Flash

The Math / How It Works

This is a BPE pre-tokenization visualizer. Real tokenization is a two-step process: (1) pre-tokenization splits text into candidate token boundaries using a regex pattern, (2) BPE merges run through the vocabulary to combine adjacent fragments into learned merge pairs. This tool runs step 1 — the regex pre-split — which is accurate to ~95% for English prose.

The cl100k_base regex pattern used as the pre-tokenizer:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+

This pattern captures contractions, word tokens, numeric tokens (up to 3 digits at a time), punctuation with optional leading space, and whitespace. The pattern is the same one used internally by tiktoken’s cl100k_base tokenizer — so the split boundaries shown are real, not approximations.

Per-family counts are derived from calibrated multipliers against the cl100k baseline: Anthropic’s tokenizer is approximately 5% denser; Llama uses SentencePiece BPE with a 128k vocabulary that handles common subword patterns slightly differently; Gemini uses a proprietary BPE with sentence-piece preprocessing. For exact counts, use the model’s native tokenizer.

How Tokenization Affects Prompt Design

Understanding tokenization changes how you write prompts. Here are concrete examples:

Token-efficient writing:

  • "Summarize:" → 2 tokens
  • "Please provide a comprehensive summary of the following text:" → 11 tokens
  • Same instruction, 5× the cost. In a 1000-call-per-day system prompt, that’s 90,000 tokens of waste per day.

Token-wasteful patterns:

  • Repetitive whitespace — each extra blank line is 1–2 tokens
  • Repeated phrases across messages — there’s no cross-message deduplication in most APIs
  • Long URLs — https://example.com/v1/api/endpoint?param=value&other=thing is 15+ tokens
  • Embedded base64 — every base64 string tokenizes extremely poorly, often 1 character per token
  • Markdown formatting artifacts — --- dividers, ### headers, repeated **bold** markers all cost tokens

Token-efficient rewriting:

  • Use abbreviations in system prompts: “Resp in JSON” not “Please respond in valid JSON format”
  • Use numbered lists not bullets — 1. is 1 token, - is also 1 but numbered lists are structurally clearer
  • Use XML tags for structure (Anthropic recommends this explicitly): <instruction> is 4 tokens, cleaner than freeform labeling
  • Cut preambles — “You are a helpful assistant that…” is often 10+ tokens of context the model doesn’t need

The context window as a budget: A 200K context window sounds enormous, but consider: a 50-turn conversation with 500-token messages per turn is 50,000 tokens — 25% of your budget, before any documents are added. Understanding token density helps you make informed decisions about what to include vs. summarize vs. drop.

Tokenization Edge Cases

Non-English text — non-Latin scripts tokenize less efficiently than English. Japanese and Chinese characters often tokenize to multiple bytes per character with BPE vocabularies trained predominantly on English data. A 100-character Chinese sentence may produce 200+ tokens, while a 100-character English sentence produces 25–35. This has real cost and context implications for non-English applications.

Code — code tokenizes differently from prose. Identifiers tokenize well (get_user, processEvent), but operators, brackets, and indentation tokenize as individual characters or small pairs. A Python function with significant indentation may tokenize 30–40% less efficiently than equivalent prose. Minified code is actually worse — it packs more semantic content into fewer characters but those characters often tokenize at 1:1.

Mathematical notation — LaTeX math expressions are extremely token-inefficient. \frac{\partial f}{\partial x} is 12+ tokens for what’s semantically one derivative symbol. For math-heavy prompts, consider using Unicode math symbols (∂, ∑, ∫) instead of LaTeX — they often tokenize as single tokens.

Special characters and emoji — emoji tokenize as multiple bytes. The thumbs-up emoji 👍 is 4 bytes in UTF-8 and typically tokenizes as 1–2 tokens. For prompts with high emoji density (social media analysis, chat data), this adds up.

Repeated patterns — BPE merges common patterns into single tokens. the is 1 token, The is 1 token, THE may be 1–2 tokens depending on the vocabulary. Common words in common casing are cheap; rare words or unusual casing are expensive.

Tips & Power Use

  • Paste your full system prompt and compare the token count to a rewritten, tighter version. The difference often surprises people who’ve never measured it.
  • Use the cross-family comparison when deciding which model to use for a cost-sensitive task. If one model family tokenizes your use-case content 15% cheaper, that’s a real budget consideration at scale.
  • Check code blocks. Code is a common source of token surprise — 10 lines of well-indented Python can cost more than you expect. See where the token density is highest.
  • Pair with the RAG Chunk Inspector — token estimates in chunk inspector use the same pre-tokenizer as this tool. Consistent numbers across both tools.
  • Pair with the LLM Cost Calculator — once you know the token count, price the API call.

What This Is and Isn’t

This is a pre-tokenization visualizer, not the full BPE merge pipeline. Token boundaries shown are accurate to ~95% for English prose, ±10% for total count, ±15–20% for code or non-Latin scripts.

For exact counts, use:

  • OpenAI: tiktoken library or the Tokenizer at platform.openai.com
  • Anthropic: anthropic.count_tokens() API method
  • Llama: the Hugging Face transformers tokenizer for the specific model checkpoint

Limitations

  • Approximation only — exact tokenization requires the actual model vocabulary, which isn’t shipped to the browser
  • Anthropic and Llama are estimates — both use proprietary or large open vocabularies; counts use calibrated multipliers
  • No special tokens<|im_start|>, <|endoftext|>, BOS/EOS tokens are not rendered separately
  • No fine-tuned tokenizers — assumes the base model tokenizer for each family

For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.