I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff
due to some added constraints.
What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.
I could use diff
to compare these files, but I don't want to because:
diff
tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.
Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.
Here are the constraints on the problem:
The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.
Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.
I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order.
For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.
Answer: This problem is called Longest Common Subsequence.
Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints.
Answer: Sort them alphabetically and then compare.
Response: That would be O(m log m+n log n), which is worse than O(m+n).
If you beat someone at their own game, you use the same methods that they have used, but more successfully, so that you gain an advantage over them. He must anticipate the maneuvers of the other lawyers and beat them at their own game. The police knew that to trap the killer they had to play him at his own game.
use someone's own methods to outdo them in their chosen activity.
: to confound opposition by using a similar strategy They tried to lure away our customers by offering deep discounts, but we beat them at their own game.
Banjax, banjo, cane, clobber, knock out, thrash, lam, lash, lick, scupper, smear, thump, tonk, wallop, whomp and whop.
Read one file, placing each file-name into a HashSet-like data structure with O(1)
add and O(1)
contains implementations.
Then read the seconds file, checking each file-name against the HashSet.
Total algorithm if file one has length m
and the second file has length n
is O(m+n)
as required.
Note: This algorithm assumes the data-set fits comfortably in physical memory to be fast.
If the data set cannot easily fit in memory, the lookup could be implemented using some variation of a B-Tree with disk paging. The complexity would then be O(mlog m)
to initially setup and O(n log m)
for each other file compare.
From a theoretical point of view, comparing the editing distance between two strings (because here you have strings in a funny language where a 'character' is a file name) cannot be made O(m+n). But here we have simplifications.
An implementation of an algorithm in your case (should contain mistakes):
# i[0], i[1] are undoable iterables; at the end they both return Null
while (a = i[0].next()) && (b = i[1].next()) : # read one item from each stream
if a != b: # skip if they are identical
c = [[a],[b]] # otherwise, prepare two fast arrays to store difference
for (w = 1; ; w = 1-w) # and read from one stream at a time
nxi = Null
if (nx = i[1-w].next()) in c[w]: # if we read a new character that matches
nxi = c[w].index(nx)
if nx is Null: nxi = -1 # or if we read end of stream
if nxi is not Null: # then output that we found some diff
for cc in c[1-w]: yield cc # the ones stored
for cc in c[w][0:nxi-1]: yield cc # and the ones stored before nx
for cc in c[w][nxi+1:]: i[w].undo(cc) # about the remainder - put it back
break # and return back to normal cycle
# one of them finished
if a: yield a
if b: yield b
for ci in i:
while (cc = ci.next()): yield cc
There are data structures that I call fast arrays -- they are probably HashSet
things, but the ones that remember ordering. The addition and lookup in them should be O(log N)
, but the memory use O(N)
.
This doesn't use any memory or cycles beyond O(m+n)
outside of finding differences. For every 'difference block' -- the operation that can be described as taking away M consequtive items and adding N ones -- this takes O(M+N)
memory and O(MN)
O(Mlog N+Nlog M)
instructions. The memory is released after a block is done, so this isn't much of a thing if you indeed only have small changes. Of course, the worst-case performance is as bad as with generic method.
In practice, a log factor difference in sorting times is probably insignificant -- sort
can sort hundreds of thousands of lines in a few seconds. So you don't actually need to write any code:
sort filelist1 > filelist1.sorted
sort filelist2 > filelist2.sorted
comm -3 filelist1.sorted filelist2.sorted > changes
I'm not claiming that this is necessarily the fastest solution -- I think Ben S's accepted answer will be, at least above some value of N. But it's definitely the simplest, it will scale to any number of files, and (unless you are the guy in charge of Google's backup operation) it will be more than fast enough for the number of files you have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With