Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Diffing more quickly

Tags:

algorithm

diff

I'm working on diffing large binary files. I've implemented the celebrated Myers Diff algorithm, which produces a minimal diff. However, it is O(ND), so to diff two very different 1 MB files, I expect to take time 1 million squared = 1 trillion. That's not good!

What I'd like is an algorithm that produces a potentially non-minimal diff, but does it much faster. I know that one must exist, because Beyond Compare does it. But I don't know how!

To be sure: There are tools like xdelta or bdiff, but these produce a patch meant for computer consumption, which is different than a human-consumable diff. A patch is concerned with transforming one file into another, so it can do things like copying from previous parts of the file. A human-consumable diff is there to visually show the differences, and can only insert and delete. For example, this transform:

"puddi" -> "puddipuddipuddi"

would produce a small patch of "copy [0,4] to [5,9] and to [10, 14]", but a larger diff of "append 'puddipuddi'". I'm interested in algorithms that produce the larger diff.

Thanks!

like image 493
fish Avatar asked Jan 06 '11 02:01

fish


2 Answers

Diffing is basically the same algorithm as is used in bioinformatics to align DNA sequences. These sequences are often large (millions or billions of nucleotides long), and one strategy that works well there on longer genomes is used by the program MUMmer:

  1. Quickly find all Maximal Unique Matches (substrings that appear in both files and which cannot be extended in either direction with that condition still holding) using a suffix tree
  2. Quickly find the longest subset of MUMs that appear in consecutive order in both files using a longest-increasing-subsequence dynamic programming algorithm
  3. Fix this subset of MUMs in the alignment (i.e. mark those regions as matching)
  4. If deemed necessary, perform slower (e.g. Myers) diffing on the inter-MUM regions. In your case, you would probably omit this step entirely if you found the length of the longest MUM was beneath some threshold (which you would take to be evidence that the 2 files are unrelated).

This tends to give a very good (though not guaranteed-optimal) set of aligned regions (or equivalently, a very small set of differences) whenever there are not too many differences. I'm not certain of the exact time bounds for each step, but I know that there are no n^2 or higher terms.

I believe the MUMmer program requires DNA or protein sequences, so it may not work out of the box for you, but the concepts certainly apply to general strings (e.g. files) so if you're prepared to reimplement it yourself I would recommend this approach.

like image 104
j_random_hacker Avatar answered Nov 10 '22 01:11

j_random_hacker


From a performance standpoint as file size grows, GNU Diffutils is probably the most robust option. For your situation I'd probably use it's side-by-side comparison format, which is probably the most human friendly of the lot. Elsewise you're off taking its output in another format and doing some work to make it pretty .

A good contender, whose performance has been improving steadily, including numerous speedups, is diff-match-patch. It implements the Myers Diff algorithm in several different languages including Java and JavaScript. See the online demo for an example of the latter with pretty printed results. If you want to do line diffing study the wiki for tips there on how to use it for that purpose.

like image 1
orangepips Avatar answered Nov 10 '22 02:11

orangepips