gekro
GitHub LinkedIn
AI

Tokenizer Visualizer

See exactly how an LLM splits your text into tokens - color-coded, byte-counted, model-aware

0 chars · 0 bytes

OpenAI

0

GPT-4 / 4o / o1 / o3

Anthropic

0

Claude 3 / 3.5 / 4

Meta Llama

0

Llama 3 / 3.1 / 3.2

Google

0

Gemini 1.5 / 2.0

0 visualized tokens

Token visualization

Tokens appear here as you type…
Word Number Symbol Whitespace Special

As-is, no warranty. These apps are free under their listed license and run entirely in your browser. Use at your own risk — don't blame me if your PC catches fire, your dog runs away, or the math turns out wrong. Verify anything that actually matters. None of this is professional financial, medical, legal, or engineering advice.

© 2026 Rohit Burani · MIT · Built at gekro.com · View source ↗

Guide

What It Does

Tokenizer Visualizer shows you how LLMs actually split your text into tokens - the unit models count, bill by, and process. Paste any text and see color-coded token chips across four model families: OpenAI, Anthropic, Meta Llama, and Google Gemini. Each chip shows token content, byte count, and character count. A summary panel shows total token counts per family side by side.

How to Use It

  1. Paste text into the input area - a system prompt, a user message, a block of code, anything.
  2. Token chips appear immediately, color-coded by type.
  3. Check the per-family token totals in the summary panel.
  4. Experiment: edit words, add whitespace, swap synonyms - see how the count changes in real time.

Color legend

  • Word (blue) - alphabetic sequences
  • Number (yellow) - numeric sequences
  • Symbol (purple) - punctuation, brackets, operators
  • Whitespace (brown) - spaces, tabs, newlines
  • Special (red) - non-printable or control characters

The four tokenizer families covered

FamilyModels
OpenAI cl100k / o200kGPT-4, GPT-4o, GPT-4-turbo, o1, o3
AnthropicClaude 3, 3.5, 3.7, 4 (Sonnet/Opus/Haiku)
Meta LlamaLlama 3, 3.1, 3.2, 3.3
Google GeminiGemini 1.5 Pro/Flash, 2.0 Flash

The Math / How It Works

This is a BPE pre-tokenization visualizer. Real tokenization is a two-step process: (1) pre-tokenization splits text into candidate token boundaries using a regex pattern, (2) BPE merges run through the vocabulary to combine adjacent fragments into learned merge pairs. This tool runs step 1 - the regex pre-split - which is accurate to ~95% for English prose.

The cl100k_base regex pattern used as the pre-tokenizer:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+

This pattern captures contractions, word tokens, numeric tokens (up to 3 digits at a time), punctuation with optional leading space, and whitespace. The pattern is the same one used internally by tiktoken’s cl100k_base tokenizer - so the split boundaries shown are real, not approximations.

Per-family counts are derived from calibrated multipliers against the cl100k baseline: Anthropic’s tokenizer is approximately 5% denser; Llama uses SentencePiece BPE with a 128k vocabulary that handles common subword patterns slightly differently; Gemini uses a proprietary BPE with sentence-piece preprocessing. For exact counts, use the model’s native tokenizer.

How Tokenization Affects Prompt Design

Understanding tokenization changes how you write prompts. Here are concrete examples:

Token-efficient writing:

  • "Summarize:" → 2 tokens
  • "Please provide a comprehensive summary of the following text:" → 11 tokens
  • Same instruction, 5× the cost. In a 1000-call-per-day system prompt, that’s 90,000 tokens of waste per day.

Token-wasteful patterns:

  • Repetitive whitespace - each extra blank line is 1–2 tokens
  • Repeated phrases across messages - there’s no cross-message deduplication in most APIs
  • Long URLs - https://example.com/v1/api/endpoint?param=value&other=thing is 15+ tokens
  • Embedded base64 - every base64 string tokenizes extremely poorly, often 1 character per token
  • Markdown formatting artifacts - --- dividers, ### headers, repeated **bold** markers all cost tokens

Token-efficient rewriting:

  • Use abbreviations in system prompts: “Resp in JSON” not “Please respond in valid JSON format”
  • Use numbered lists not bullets - 1. is 1 token, - is also 1 but numbered lists are structurally clearer
  • Use XML tags for structure (Anthropic recommends this explicitly): <instruction> is 4 tokens, cleaner than freeform labeling
  • Cut preambles - “You are a helpful assistant that…” is often 10+ tokens of context the model doesn’t need

The context window as a budget: A 200K context window sounds enormous, but consider: a 50-turn conversation with 500-token messages per turn is 50,000 tokens - 25% of your budget, before any documents are added. Understanding token density helps you make informed decisions about what to include vs. summarize vs. drop.

Tokenization Edge Cases

Non-English text - non-Latin scripts tokenize less efficiently than English. Japanese and Chinese characters often tokenize to multiple bytes per character with BPE vocabularies trained predominantly on English data. A 100-character Chinese sentence may produce 200+ tokens, while a 100-character English sentence produces 25–35. This has real cost and context implications for non-English applications.

Code - code tokenizes differently from prose. Identifiers tokenize well (get_user, processEvent), but operators, brackets, and indentation tokenize as individual characters or small pairs. A Python function with significant indentation may tokenize 30–40% less efficiently than equivalent prose. Minified code is actually worse - it packs more semantic content into fewer characters but those characters often tokenize at 1:1.

Mathematical notation - LaTeX math expressions are extremely token-inefficient. \frac{\partial f}{\partial x} is 12+ tokens for what’s semantically one derivative symbol. For math-heavy prompts, consider using Unicode math symbols (∂, ∑, ∫) instead of LaTeX - they often tokenize as single tokens.

Special characters and emoji - emoji tokenize as multiple bytes. The thumbs-up emoji 👍 is 4 bytes in UTF-8 and typically tokenizes as 1–2 tokens. For prompts with high emoji density (social media analysis, chat data), this adds up.

Repeated patterns - BPE merges common patterns into single tokens. the is 1 token, The is 1 token, THE may be 1–2 tokens depending on the vocabulary. Common words in common casing are cheap; rare words or unusual casing are expensive.

Tips & Power Use

  • Paste your full system prompt and compare the token count to a rewritten, tighter version. The difference often surprises people who’ve never measured it.
  • Use the cross-family comparison when deciding which model to use for a cost-sensitive task. If one model family tokenizes your use-case content 15% cheaper, that’s a real budget consideration at scale.
  • Check code blocks. Code is a common source of token surprise - 10 lines of well-indented Python can cost more than you expect. See where the token density is highest.
  • Pair with the RAG Chunk Inspector - token estimates in chunk inspector use the same pre-tokenizer as this tool. Consistent numbers across both tools.
  • Pair with the LLM Cost Calculator - once you know the token count, price the API call.

What This Is and Isn’t

This is a pre-tokenization visualizer, not the full BPE merge pipeline. Token boundaries shown are accurate to ~95% for English prose, ±10% for total count, ±15–20% for code or non-Latin scripts.

For exact counts, use:

  • OpenAI: tiktoken library or the Tokenizer at platform.openai.com
  • Anthropic: anthropic.count_tokens() API method
  • Llama: the Hugging Face transformers tokenizer for the specific model checkpoint

Limitations

  • Approximation only - exact tokenization requires the actual model vocabulary, which isn’t shipped to the browser
  • Anthropic and Llama are estimates - both use proprietary or large open vocabularies; counts use calibrated multipliers
  • No special tokens - <|im_start|>, <|endoftext|>, BOS/EOS tokens are not rendered separately
  • No fine-tuned tokenizers - assumes the base model tokenizer for each family

For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.