Text Diff Checker · 7 min read
How Diff Algorithms Work — From Unix to Git
The diff algorithm compares two texts and finds the minimum set of changes that transforms one into the other. Created in 1974 at Bell Labs, it now underlies Git, GitHub, and every code review tool in existence.
The Problem: Finding the Shortest Edit Script
The core problem a diff algorithm solves is: given two sequences of items (lines, characters, words), what is the minimum set of insertions and deletions needed to transform the first sequence into the second?
This is called the Shortest Edit Script (SES) problem, and it is equivalent to finding the Longest Common Subsequence (LCS) of the two sequences. The items that appear in both sequences (in the same relative order) form the LCS — everything else is either a deletion from the original or an insertion in the new version.
Example: comparing "ABCDE" to "ACEFG":
- LCS: A, C, E (appear in both in the same order)
- Deletions: B, D (in original but not new)
- Insertions: F, G (in new but not original)
The diff output shows: A (unchanged), B (deleted), C (unchanged), D (deleted), E (unchanged), F (inserted), G (inserted). This is exactly the format you see in a terminal diff output or a code review tool.
The Original Unix diff (1974)
The diff command was created at Bell Laboratories by Douglas McIlroy and James Hunt, with contributions from later colleagues. The first version was released as part of Unix in 1974. Hunt and McIlroy published the underlying algorithm in a Bell Labs Technical Report in 1975.
The original algorithm used a dynamic programming approach to find the LCS, running in O(mn) time and space — where m and n are the lengths of the two files being compared. For the file sizes of 1974 (small programs, configuration files), this was entirely practical.
The output format defined by the original Unix diff — with lines prefixed by < for deletions, > for insertions, and context around each change — became the standard that all subsequent diff tools adopted. The "unified diff" format (lines prefixed by - and + with three lines of context) was introduced later and is the format used by git diff today.
The Myers Algorithm (1986)
The most widely used modern diff algorithm was published by Eugene Myers in 1986 in the journal Algorithmica. Myers's O(ND) algorithm is faster than the original O(mn) approach by exploiting a key insight: rather than computing the entire LCS, it searches for the shortest path through an "edit graph" — a grid where moving right represents keeping a character from the original, moving down represents inserting a character from the new version, and moving diagonally represents a match.
The algorithm's time complexity is O(ND), where N is the total length of both sequences and D is the size of the minimum edit script (number of insertions plus deletions). For files that are similar (small D), the algorithm is extremely fast. For completely different files (large D), it degrades toward O(N²) but in practice most version-controlled files are similar between revisions.
The Myers algorithm is used in:
- GNU diff (the standard diff on Linux systems)
- Git (as its default diff algorithm)
- GitHub and GitLab code review interfaces
- The diff libraries used by most modern programming languages
Git's Diff Variants
Git supports multiple diff algorithms, selectable with the --diff-algorithm flag:
Myers (default)
The standard Myers algorithm. Fast and produces compact diffs. Occasionally produces "ugly" diffs where the output is correct but not where a human would have drawn the change boundaries — for example, when a block of code is moved, Myers may show it as a deletion at the old position and insertion at the new position rather than recognising it as a move.
Patience
The Patience diff algorithm was designed by Bram Cohen (creator of BitTorrent) to produce more human-readable diffs. It first finds unique lines that appear in both files exactly once, uses those as anchor points, and then applies Myers recursively within each section. The result tends to group related changes together in a way that matches human intuition about where changes occurred.
Histogram
An extension of Patience diff that uses a frequency histogram to handle cases where lines appear more than once. This is Git's recommended algorithm for complex diffs and is used as the default in some Git GUI tools. git diff --diff-algorithm=histogram
Minimal
Spends extra computation to find the absolutely smallest possible edit script. Slower than Myers but useful when diff size matters more than speed — for example, when computing binary diffs or patch files.
Character-Level vs. Line-Level vs. Word-Level Diff
Most version control tools diff at the line level — the unit of comparison is an entire line of text. This is efficient and meaningful for code, where lines correspond to statements or declarations. But line-level diffs can be coarse: if a single word changes in a long line, the entire line is marked as changed.
Document editors (Google Docs, Word's Track Changes) diff at the word or character level to show precisely which words were added or removed within a paragraph. This is more expensive but produces much more precise output for natural language text.
Web-based code review tools like GitHub use a combination: line-level diff for the primary view, with character-level highlighting within changed lines to show exactly which characters changed. This gives the best of both approaches — structural overview plus character-level precision.
Three-Way Diff and Merge
When two people edit the same file in parallel, a standard two-file diff cannot resolve the conflict automatically. Three-way diff compares three files: the common ancestor (the original before either person edited it) and the two modified versions. Changes that appear in one branch but not the other can be applied automatically. Only changes to the same section in both branches require human intervention — a merge conflict.
Git's merge command, most IDEs' merge conflict resolution tools, and collaborative editing systems all use three-way diff as their underlying algorithm. The ancestor version is the key: without it, the tool cannot distinguish between "A changed this" and "B changed this" — it can only see that the two versions differ.
References
- Hunt, J.W., & McIlroy, M.D. (1975). An algorithm for differential file comparison. Bell Laboratories Computing Science Technical Report 41.
- Myers, E.W. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(1–4), 251–266.
- Miller, W., & Myers, E.W. (1985). A file comparison program. Software: Practice and Experience, 15(11), 1025–1040.
- Torvalds, L. (2005). Git: Initial revision. github.com/git/git.
- Ukkonen, E. (1985). Algorithms for approximate string matching. Information and Control, 64(1–3), 100–118.