I'm working on diffing large binary files. I've implemented the celebrated Myers Diff algorithm, which produces a minimal diff. However, it is O(ND), so to diff two very different 1 MB files, I expect to take time 1 million squared = 1 trillion. That's not good!
What I'd like is an algorithm that produces a potentially non-minimal diff, but does it much faster. I know that one must exist, because Beyond Compare does it. But I don't know how!
To be sure: There are tools like xdelta or bdiff, but these produce a patch meant for computer consumption, which is different than a human-consumable diff. A patch is concerned with transforming one file into another, so it can do things like copying from previous parts of the file. A human-consumable diff is there to visually show the differences, and can only insert and delete. For example, this transform:
"puddi" -> "puddipuddipuddi"
would produce a small patch of "copy [0,4] to [5,9] and to [10, 14]", but a larger diff of "append 'puddipuddi'". I'm interested in algorithms that produce the larger diff.
Thanks!
Diffing is basically the same algorithm as is used in bioinformatics to align DNA sequences. These sequences are often large (millions or billions of nucleotides long), and one strategy that works well there on longer genomes is used by the program MUMmer:
This tends to give a very good (though not guaranteed-optimal) set of aligned regions (or equivalently, a very small set of differences) whenever there are not too many differences. I'm not certain of the exact time bounds for each step, but I know that there are no n^2
or higher terms.
I believe the MUMmer program requires DNA or protein sequences, so it may not work out of the box for you, but the concepts certainly apply to general strings (e.g. files) so if you're prepared to reimplement it yourself I would recommend this approach.
From a performance standpoint as file size grows, GNU Diffutils is probably the most robust option. For your situation I'd probably use it's side-by-side comparison format, which is probably the most human friendly of the lot. Elsewise you're off taking its output in another format and doing some work to make it pretty .
A good contender, whose performance has been improving steadily, including numerous speedups, is diff-match-patch. It implements the Myers Diff algorithm in several different languages including Java and JavaScript. See the online demo for an example of the latter with pretty printed results. If you want to do line diffing study the wiki for tips there on how to use it for that purpose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With