The <code>diff</code> program, in its various incarnations, is reasonably good at computing the difference between two text files and expressing it more compactly than showing both files in their entirety. It shows the difference as a sequence of inserted and deleted chunks of lines (or changed lines in some cases, but that's equivalent to a deletion followed by an insertion). The same or very similar program or algorithm is used by <code>patch</code> and by source control systems to minimize the storage required to represent the differences between two versions of the same file. The algorithm is discussed here and here. But it falls down when blocks of text are moved within the file. Suppose you have the following two files, <code>a.txt</code> and <code>b.txt</code> (imagine that they're both hundreds of lines long rather than just 6): <pre class="prettyprint"><code>a.txt b.txt ----- ----- 1 4 2 5 3 6 4 1 5 2 6 3 </code></pre> <code>diff a.txt b.txt</code> shows this: <pre class="prettyprint"><code>$ diff a.txt b.txt 1,3d0 < 1 < 2 < 3 6a4,6 > 1 > 2 > 3 </code></pre> The change from <code>a.txt</code> to <code>b.txt</code> can be expressed as "Take the first three lines and move them to the end", but <code>diff</code> shows the complete contents of the moved chunk of lines twice, missing an opportunity to describe this large change very briefly. Note that <code>diff -e</code> shows the block of text only once, but that's because it doesn't show the contents of deleted lines. Is there a variant of the <code>diff</code> algorithm that (a) retains <code>diff</code>'s ability to represent insertions and deletions, and (b) efficiently represents moved blocks of text without having to show their entire contents?

Since you asked for an algorithm and not an application, take a look at "The String-to-String Correction Problem with Block Moves" by Walter Tichy. There are others, but that's the original, so you can look for papers that cite it to find more. The paper cites Paul Heckel's paper "A technique for isolating differences between files" (mentioned in this answer to this question) and mentions this about its algorithm: <blockquote> Heckel[3] pointed out similar problems with LCS techniques and proposed a linear-lime algorithm to detect block moves. The algorithm performs adequately if there are few duplicate symbols in the strings. However, the algorithm gives poor results otherwise. For example, given the two strings aabb and bbaa, Heckel's algorithm fails to discover any common substring. </blockquote>

Is there a diff-like algorithm that handles moving block of lines?

Tags:

algorithm

diff

The diff program, in its various incarnations, is reasonably good at computing the difference between two text files and expressing it more compactly than showing both files in their entirety. It shows the difference as a sequence of inserted and deleted chunks of lines (or changed lines in some cases, but that's equivalent to a deletion followed by an insertion). The same or very similar program or algorithm is used by patch and by source control systems to minimize the storage required to represent the differences between two versions of the same file. The algorithm is discussed here and here.

But it falls down when blocks of text are moved within the file.

Suppose you have the following two files, a.txt and b.txt (imagine that they're both hundreds of lines long rather than just 6):

a.txt   b.txt -----   ----- 1       4 2       5 3       6 4       1 5       2 6       3

diff a.txt b.txt shows this:

$ diff a.txt b.txt  1,3d0 < 1 < 2 < 3 6a4,6 > 1 > 2 > 3

The change from a.txt to b.txt can be expressed as "Take the first three lines and move them to the end", but diff shows the complete contents of the moved chunk of lines twice, missing an opportunity to describe this large change very briefly.

Note that diff -e shows the block of text only once, but that's because it doesn't show the contents of deleted lines.

Is there a variant of the diff algorithm that (a) retains diff's ability to represent insertions and deletions, and (b) efficiently represents moved blocks of text without having to show their entire contents?

996

asked Apr 08 '12 20:04

Keith Thompson

1 Answers

Since you asked for an algorithm and not an application, take a look at "The String-to-String Correction Problem with Block Moves" by Walter Tichy. There are others, but that's the original, so you can look for papers that cite it to find more.

The paper cites Paul Heckel's paper "A technique for isolating differences between files" (mentioned in this answer to this question) and mentions this about its algorithm:

Heckel[3] pointed out similar problems with LCS techniques and proposed a linear-lime algorithm to detect block moves. The algorithm performs adequately if there are few duplicate symbols in the strings. However, the algorithm gives poor results otherwise. For example, given the two strings aabb and bbaa, Heckel's algorithm fails to discover any common substring.

128

answered Sep 24 '22 15:09

Zoë Peterson

Related questions
                            
                                Octave : logistic regression : difference between fmincg and fminunc
                            
                                Finding the best trade-off point on a curve
                            
                                Edit Distance in Python
                            
                                How should I map long to int in hashCode()?
                            
                                Algorithm/Data Structure Design Interview Questions [closed]
                            
                                Nice & universal way to convert List of items to Tree
                            
                                Simple Python Challenge: Fastest Bitwise XOR on Data Buffers
                            
                                Detecting if a string has unique characters: comparing my solution to "Cracking the Coding Interview?"
                            
                                Merge Sort a Linked List
                            
                                vba: get unique values from array
                            
                                How to implement 3 stacks with one array?
                            
                                Algorithm to generate bit mask
                            
                                String similarity metrics in Python
                            
                                Stack with find-min/find-max more efficient than O(n)?
                            
                                All Possible Combinations of a list of Values
                            
                                Is there a way to shorten this while condition?
                            
                                Difference between priority queue and a heap
                            
                                Euler project #18 approach
                            
                                Find the majority element in array
                            
                                how to check if a string looks randomized, or human generated and pronouncable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With