Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where can I find the diff algorithm?

Tags:

algorithm

diff

Where can I find an explanation and implementation of the diff algorithm?

First of all I have to recognize that I'm not sure if this is the correct name of the algorithm. For example, how does Stack Overflow mark the differences between two edits of the same question?

PS: I know the C and PHP programming languages.

like image 712
dole doug Avatar asked May 12 '10 08:05

dole doug


People also ask

What algorithm does diff use?

Myers Algorithm – human readable diffs This is used by tools such as Git Diff and GNU Diff. Original Myers time and space complexity is O(ND) where N is the sum of the lengths of both inputs and D is the size of the minimum edit script that converts one input to the other.

What diff algorithm does Github use?

Git supports 4 diff algorithms Myers, Minimal, Patience, and Histogram. And Myers is used as the default algorithm.

How is diff implemented?

The diff command is invoked from the command line, passing it the names of two files: diff original new . The output of the command represents the changes required to transform the original file into the new file. If original and new are directories, then diff will be run on each file that exists in both directories.

How does diff checker work?

How does Difference Checker work? This text difference checker will compare the given files and highlights the unique text written in both files. It compares the text line-by-line and checks the possible difference between the given documents.


2 Answers

There is really no such thing as "the diff algorithm". There are many different diff algorithms, and in fact the particular diff algorithms used are in some cases considered a business advantage of the particular diff tool.

In general, many diff algorithms are based on the Longest Common Subsequence (LCS) problem.

The original Unix diff program from the 1970s was written by Doug McIllroy and uses what is known as the Hunt-McIllroy algorithm. Almost 40 years later, extensions and derivatives of that algorithm are still very common.

A couple of years ago, Bram Cohen (creator of the most successful filesharing program and the least successful version control system) created the Patience Diff algorithm that is designed to give more human-readable results than LCS. It was originally implemented in the Bazaar VCS and also added to Git as an option.

However, unless you are interested in research on diff algorithms, your best bet would probably be to just use some existing diff library like Davide Libenzi's LibXDiff, which is for example what Git uses. I wouldn't be too surprised if there was already a PHP extension wrapping it. An nice alternative is Google's Diff-Match-Patch library, which is used in Bespin or WhiteRoom, for example and which is available for many languages. It uses the Meyers Diff Algorithm plus some pre- and post-processing for additional speedups.

A completely different approach, if you are more interested in merging than diffing, is called Operational Transformations. The idea of OT is that instead of figuring out the differences between two documents, you try to "reverse engineer" the operations that led to those differences. This allows for much better merging, because you can then "replay" those operations. These are most useful for real-time collaborative editors such as EtherPad, Google Wave or SubEthaEdit.

like image 111
Jörg W Mittag Avatar answered Oct 21 '22 23:10

Jörg W Mittag


What is wrong with wikipedia where it states it is Hunt-McIlroy algorithm?

There is OCR'd paper describing the algorithm (explanation) and you can inspect source (implementation).

The so related questions also list (among others):
'Best' Diff Algorithm
How do document diff algorithms work?
Diff Algorithm
which all seem to be useful.

like image 23
Unreason Avatar answered Oct 21 '22 22:10

Unreason