Compare two prompt versions side-by-side — spot every change at a glance
Version A — Before
Version B — After
Char delta
—
Line delta
—
Token delta
—
Similarity
—
Diff
Paste Version A (before) and Version B (after) of any prompt — system prompt, user message, few-shot example, whatever — and the tool produces a color-coded diff. Green lines or words were added. Red lines or words were removed. Grey lines are unchanged. The stats bar shows character count delta, line count delta, estimated token delta, and a similarity percentage.
Two diff modes: Line diff for high-level structure changes (added paragraphs, moved sections), Word diff for fine-grained edits within lines (changed phrasing, substituted words).
Everything runs in the browser. No data leaves your machine.
.txt file with both prompt versions and the diff — useful for sharing with teammates or storing in version history.The diff algorithm is an implementation of the Myers diff approach via dynamic programming on the Longest Common Subsequence (LCS). The idea:
(m+1) × (n+1) DP table where m is the length of sequence A and n is the length of sequence B.dp[i][j] = dp[i-1][j-1] + 1 if A[i] == B[j], otherwise max(dp[i-1][j], dp[i][j-1]).dp[m][n] to reconstruct the edit script: where elements matched (diagonal move), emit eq; where we consumed from B only (left move), emit add; where we consumed from A only (up move), emit rem.This is O(m × n) time and space — fine for prompt-sized text (hundreds of lines, thousands of words). It won’t degrade at any realistic prompt size.
For line diff, the sequence elements are individual lines (split on \n). Each line is treated as an atomic unit. If a single word changes inside a line, the entire line is shown as removed and re-added. This gives a clean, git-style view of structural changes.
For word diff, the text is tokenized into whitespace-separated tokens (words and whitespace sequences are both tokens, preserving the ability to reconstruct the original). The LCS diff runs on this token array. Changed words are highlighted inline within their line context, unchanged text is shown in muted grey. This mode is better for spotting phrasing changes — a reworded sentence shows exactly which words swapped.
The similarity score uses Sørensen-Dice on the LCS:
similarity = (2 × matched_elements) / (|A| + |B|)
Where matched_elements is the count of elements in the LCS (the eq entries in the edit script). A score of 100% means the prompts are identical. 0% means they share no common elements at all. In practice, iterative prompt refinements tend to score 70–95% — enough shared structure that most elements are unchanged, but meaningful additions and deletions.
The score updates in real time as you type. It’s computed on the same sequence used for the diff (lines in line mode, words in word mode), so switching modes will change the score.
Prompt engineering is software engineering in disguise, with one important difference: version control tooling built for code treats prompts as opaque text blobs. git diff on a prompt file works, but it doesn’t render in context, doesn’t show token counts, and you can’t run it on two prompts you’re comparing in a notebook or chat interface.
In practice, prompt iteration looks like this: you have v1 that performs at 80% on your eval set. You make what you think is a targeted tweak and get v2. Did v2 actually do what you intended? Without a diff, you’re comparing two blocks of text by memory, which degrades fast past 200 words.
The most common failure mode is a prompt that works in testing because you subconsciously remember the intended behavior, but the actual text change was broader than you realized — you rephrased a constraint slightly and the model stopped following it in edge cases. A diff makes that visible.
Prompt regression testing is another use case. If you’re running evaluations and a previously-passing test starts failing after a prompt update, the diff tells you exactly which lines changed so you can isolate the cause. Without it, you’re debugging a black box by re-reading a 500-word document and trying to spot the difference.
Collaboration is a third case. Sending v7_final_FINAL.txt to a teammate is bad enough. Sending them the diff alongside it tells them precisely what changed and why those changes should matter — it’s the commit message for your prompt.
Green lines / words — present in Version B but not Version A. These are additions.
Red lines / words — present in Version A but not Version B. These are deletions. In line mode, a changed line appears as one red line (removed) followed by one green line (added). In word mode, changed words appear inline as red struck-through text next to the green replacement.
Grey lines — unchanged. Present in both versions. In line mode, all unchanged lines are shown. There’s no context collapse (no ... ellipsis hiding lines) — prompt diffs are short enough that full context is always useful.
Stats bar:
Use line diff first, then word diff. Line diff shows you which paragraphs changed. Word diff shows you the exact phrasing changes within those paragraphs. The two modes are complementary.
Paste system prompt and user message together if you’re comparing full prompt compositions. The diff doesn’t care — it just compares two text blocks.
Track your prompt history manually. Export the diff after each significant version. Filename with a timestamp or version number. You now have a readable change log for your prompt. This is lightweight prompt version control that works without any tooling overhead.
Use the token delta to budget changes. If you added 150 tokens to your system prompt, that’s 150 tokens less available for context in every call. The stat makes that cost explicit.
Similarity as a sanity check. If you intended to make a small tweak but the similarity dropped from 95% to 70%, something broader changed. Review the diff before deploying.
Comparing few-shot examples. If you’re iterating on the examples in your prompt (the user/assistant turns you include as demonstrations), word diff is especially useful — it shows the exact word substitutions in your target outputs.
tiktoken with the specific model’s encoder.For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.