Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can diff be beaten at its own game?

I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff due to some added constraints.

What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.

I could use diff to compare these files, but I don't want to because:

  1. diff tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.

  2. Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.

Here are the constraints on the problem:

  1. The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.

  2. Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.

  3. I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.

I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order.

For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.

Answer: This problem is called Longest Common Subsequence.

Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints.

Answer: Sort them alphabetically and then compare.

Response: That would be O(m log m+n log n), which is worse than O(m+n).

like image 455
John Kugelman Avatar asked Jun 20 '09 04:06

John Kugelman


People also ask

Can you beat someone at their own game?

If you beat someone at their own game, you use the same methods that they have used, but more successfully, so that you gain an advantage over them. He must anticipate the maneuvers of the other lawyers and beat them at their own game. The police knew that to trap the killer they had to play him at his own game.

Can't beat them at their own game?

use someone's own methods to outdo them in their chosen activity.

What does beating someone at their own game mean?

: to confound opposition by using a similar strategy They tried to lure away our customers by offering deep discounts, but we beat them at their own game.

What is another word for beating someone at a game?

Banjax, banjo, cane, clobber, knock out, thrash, lam, lash, lick, scupper, smear, thump, tonk, wallop, whomp and whop.


3 Answers

Read one file, placing each file-name into a HashSet-like data structure with O(1) add and O(1) contains implementations.

Then read the seconds file, checking each file-name against the HashSet.

Total algorithm if file one has length m and the second file has length n is O(m+n) as required.

Note: This algorithm assumes the data-set fits comfortably in physical memory to be fast.

If the data set cannot easily fit in memory, the lookup could be implemented using some variation of a B-Tree with disk paging. The complexity would then be O(mlog m) to initially setup and O(n log m) for each other file compare.

like image 147
Ben S Avatar answered Sep 21 '22 11:09

Ben S


From a theoretical point of view, comparing the editing distance between two strings (because here you have strings in a funny language where a 'character' is a file name) cannot be made O(m+n). But here we have simplifications.

An implementation of an algorithm in your case (should contain mistakes):

# i[0], i[1] are undoable iterables; at the end they both return Null

while (a = i[0].next()) && (b = i[1].next()) :    # read one item from each stream
    if a != b:                 # skip if they are identical
        c = [[a],[b]]          # otherwise, prepare two fast arrays to store difference
        for (w = 1; ; w = 1-w) # and read from one stream at a time
             nxi = Null        
             if (nx = i[1-w].next()) in c[w]:  # if we read a new character that matches
                  nxi = c[w].index(nx)          
             if nx is Null: nxi = -1           # or if we read end of stream
             if nxi is not Null:               # then output that we found some diff
                 for cc in c[1-w]: yield cc              # the ones stored 
                 for cc in c[w][0:nxi-1]: yield cc       # and the ones stored before nx
                 for cc in c[w][nxi+1:]: i[w].undo(cc)   # about the remainder - put it back
                 break                         # and return back to normal cycle
 # one of them finished
 if a: yield a
 if b: yield b
 for ci in i: 
     while (cc = ci.next()): yield cc

There are data structures that I call fast arrays -- they are probably HashSet things, but the ones that remember ordering. The addition and lookup in them should be O(log N), but the memory use O(N).

This doesn't use any memory or cycles beyond O(m+n) outside of finding differences. For every 'difference block' -- the operation that can be described as taking away M consequtive items and adding N ones -- this takes O(M+N) memory and O(MN) O(Mlog N+Nlog M) instructions. The memory is released after a block is done, so this isn't much of a thing if you indeed only have small changes. Of course, the worst-case performance is as bad as with generic method.

like image 2
ilya n. Avatar answered Sep 21 '22 11:09

ilya n.


In practice, a log factor difference in sorting times is probably insignificant -- sort can sort hundreds of thousands of lines in a few seconds. So you don't actually need to write any code:

sort filelist1 > filelist1.sorted
sort filelist2 > filelist2.sorted
comm -3 filelist1.sorted filelist2.sorted > changes

I'm not claiming that this is necessarily the fastest solution -- I think Ben S's accepted answer will be, at least above some value of N. But it's definitely the simplest, it will scale to any number of files, and (unless you are the guy in charge of Google's backup operation) it will be more than fast enough for the number of files you have.

like image 2
j_random_hacker Avatar answered Sep 19 '22 11:09

j_random_hacker