I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than <code>diff</code> due to some added constraints. What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots. I could use <code>diff</code> to compare these files, but I don't want to because: <ol> <li><code>diff</code> tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.</li> <li>Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.</li> </ol> Here are the constraints on the problem: <ol> <li>The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.</li> <li>Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.</li> <li>I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.</li> </ol> I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order. For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront. Answer: This problem is called Longest Common Subsequence. Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints. Answer: Sort them alphabetically and then compare. Response: That would be O(m log m+n log n), which is worse than O(m+n).

Read one file, placing each file-name into a HashSet-like data structure with <code>O(1)</code> add and <code>O(1)</code> contains implementations. Then read the seconds file, checking each file-name against the HashSet. Total algorithm if file one has length <code>m</code> and the second file has length <code>n</code> is <code>O(m+n)</code> as required. Note: This algorithm assumes the data-set fits comfortably in physical memory to be fast. If the data set cannot easily fit in memory, the lookup could be implemented using some variation of a B-Tree with disk paging. The complexity would then be <code>O(mlog m)</code> to initially setup and <code>O(n log m)</code> for each other file compare.

In practice, a log factor difference in sorting times is probably insignificant -- <code>sort</code> can sort hundreds of thousands of lines in a few seconds. So you don't actually need to write any code: <pre class="prettyprint"><code>sort filelist1 > filelist1.sorted sort filelist2 > filelist2.sorted comm -3 filelist1.sorted filelist2.sorted > changes </code></pre> I'm not claiming that this is necessarily the fastest solution -- I think Ben S's accepted answer will be, at least above some value of N. But it's definitely the simplest, it will scale to any number of files, and (unless you are the guy in charge of Google's backup operation) it will be more than fast enough for the number of files you have.

Can diff be beaten at its own game?

Tags:

algorithm

big-o

comparison

diff

I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff due to some added constraints.

What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.

I could use diff to compare these files, but I don't want to because:

diff tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.
Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.

Here are the constraints on the problem:

The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.
Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.

I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order.

For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.

Answer: This problem is called Longest Common Subsequence.

Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints.

Answer: Sort them alphabetically and then compare.

Response: That would be O(m log m+n log n), which is worse than O(m+n).

455

asked Jun 20 '09 04:06

John Kugelman

3 Answers

Read one file, placing each file-name into a HashSet-like data structure with O(1) add and O(1) contains implementations.

Then read the seconds file, checking each file-name against the HashSet.

Total algorithm if file one has length m and the second file has length n is O(m+n) as required.

Note: This algorithm assumes the data-set fits comfortably in physical memory to be fast.

If the data set cannot easily fit in memory, the lookup could be implemented using some variation of a B-Tree with disk paging. The complexity would then be O(mlog m) to initially setup and O(n log m) for each other file compare.

147

answered Sep 21 '22 11:09

Ben S

From a theoretical point of view, comparing the editing distance between two strings (because here you have strings in a funny language where a 'character' is a file name) cannot be made O(m+n). But here we have simplifications.

An implementation of an algorithm in your case (should contain mistakes):

Click to copy

# i[0], i[1] are undoable iterables; at the end they both return Null

while (a = i[0].next()) && (b = i[1].next()) :    # read one item from each stream
    if a != b:                 # skip if they are identical
        c = [[a],[b]]          # otherwise, prepare two fast arrays to store difference
        for (w = 1; ; w = 1-w) # and read from one stream at a time
             nxi = Null        
             if (nx = i[1-w].next()) in c[w]:  # if we read a new character that matches
                  nxi = c[w].index(nx)          
             if nx is Null: nxi = -1           # or if we read end of stream
             if nxi is not Null:               # then output that we found some diff
                 for cc in c[1-w]: yield cc              # the ones stored 
                 for cc in c[w][0:nxi-1]: yield cc       # and the ones stored before nx
                 for cc in c[w][nxi+1:]: i[w].undo(cc)   # about the remainder - put it back
                 break                         # and return back to normal cycle
 # one of them finished
 if a: yield a
 if b: yield b
 for ci in i: 
     while (cc = ci.next()): yield cc

There are data structures that I call fast arrays -- they are probably HashSet things, but the ones that remember ordering. The addition and lookup in them should be O(log N), but the memory use O(N).

This doesn't use any memory or cycles beyond O(m+n) outside of finding differences. For every 'difference block' -- the operation that can be described as taking away M consequtive items and adding N ones -- this takes O(M+N) memory and ~~O(MN)~~ O(Mlog N+Nlog M) instructions. The memory is released after a block is done, so this isn't much of a thing if you indeed only have small changes. Of course, the worst-case performance is as bad as with generic method.

answered Sep 21 '22 11:09

ilya n.

In practice, a log factor difference in sorting times is probably insignificant -- sort can sort hundreds of thousands of lines in a few seconds. So you don't actually need to write any code:

Click to copy

sort filelist1 > filelist1.sorted
sort filelist2 > filelist2.sorted
comm -3 filelist1.sorted filelist2.sorted > changes

I'm not claiming that this is necessarily the fastest solution -- I think Ben S's accepted answer will be, at least above some value of N. But it's definitely the simplest, it will scale to any number of files, and (unless you are the guy in charge of Google's backup operation) it will be more than fast enough for the number of files you have.

answered Sep 19 '22 11:09

j_random_hacker

Related questions
                            
                                StackOverflowError in Math.Random in a randomly recursive method
                            
                                Generating Ascending Sequence 2^p*3^q
                            
                                Permutations excluding repeated characters
                            
                                How to sort an array of integers faster than quicksort?
                            
                                How to find the fixed points of a simple mod function elegantly?
                            
                                Comparing std::minmax against a pair
                            
                                Calculating "Kevin Bacon" Numbers
                            
                                Confused with Voronoi diagram algorithm (Fortune's sweepline)
                            
                                Advanced Banner-Rotation Algorithms
                            
                                Triangle Triangle Intersection in 3d-Space
                            
                                Calculating Percentiles on the fly
                            
                                How to modify Levenshteins Edit Distance to count "adjacent letter exchanges" as 1 edit
                            
                                How to implement Random(a,b) with only Random(0,1)? [duplicate]
                            
                                Find the largest possible difference in an array with the smaller integer occurring earlier
                            
                                How to find the Largest Difference in an Array
                            
                                Algorithm for generating vibration patterns ranging in intensity in Android?
                            
                                What algorithm Asp.net Identity uses to encrypt the password?
                            
                                What algorithm used to find the nth sorted subarray of an unordered array?
                            
                                Greatest linear dimension 2d set of points
                            
                                A simple example of a cache aware algorithm?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With