Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a diff-like algorithm that handles moving block of lines?

Tags:

algorithm

diff

The diff program, in its various incarnations, is reasonably good at computing the difference between two text files and expressing it more compactly than showing both files in their entirety. It shows the difference as a sequence of inserted and deleted chunks of lines (or changed lines in some cases, but that's equivalent to a deletion followed by an insertion). The same or very similar program or algorithm is used by patch and by source control systems to minimize the storage required to represent the differences between two versions of the same file. The algorithm is discussed here and here.

But it falls down when blocks of text are moved within the file.

Suppose you have the following two files, a.txt and b.txt (imagine that they're both hundreds of lines long rather than just 6):

a.txt   b.txt -----   ----- 1       4 2       5 3       6 4       1 5       2 6       3 

diff a.txt b.txt shows this:

$ diff a.txt b.txt  1,3d0 < 1 < 2 < 3 6a4,6 > 1 > 2 > 3 

The change from a.txt to b.txt can be expressed as "Take the first three lines and move them to the end", but diff shows the complete contents of the moved chunk of lines twice, missing an opportunity to describe this large change very briefly.

Note that diff -e shows the block of text only once, but that's because it doesn't show the contents of deleted lines.

Is there a variant of the diff algorithm that (a) retains diff's ability to represent insertions and deletions, and (b) efficiently represents moved blocks of text without having to show their entire contents?

like image 996
Keith Thompson Avatar asked Apr 08 '12 20:04

Keith Thompson


People also ask

What is the diff algorithm?

A diff algorithm outputs the set of differences between two inputs. These algorithms are the basis of a number of commonly used developer tools. Yet understanding the inner workings of diff algorithms is rarely necessary to use said tools.

How does diff tool work?

The diff command is invoked from the command line, passing it the names of two files: diff original new . The output of the command represents the changes required to transform the original file into the new file. If original and new are directories, then diff will be run on each file that exists in both directories.

What are diffs in programming?

Alternatively referred to as compare, diff is short for different or difference and describes a program's ability to show the difference between two or more files. A diff is an invaluable tool in programming as it enables a developer to see what has changed in-between versions.


1 Answers

Since you asked for an algorithm and not an application, take a look at "The String-to-String Correction Problem with Block Moves" by Walter Tichy. There are others, but that's the original, so you can look for papers that cite it to find more.

The paper cites Paul Heckel's paper "A technique for isolating differences between files" (mentioned in this answer to this question) and mentions this about its algorithm:

Heckel[3] pointed out similar problems with LCS techniques and proposed a linear-lime algorithm to detect block moves. The algorithm performs adequately if there are few duplicate symbols in the strings. However, the algorithm gives poor results otherwise. For example, given the two strings aabb and bbaa, Heckel's algorithm fails to discover any common substring.

like image 128
Zoë Peterson Avatar answered Sep 24 '22 15:09

Zoë Peterson