Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two large files

I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours.

Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on doing it in bash or python.

Thanks a lot in advance.

[EDIT 1]: Sorry, when i say difference between two files i did not mean a diff. The result file is in a different format.

The logic is a bit like this:

File A has 297.599 lines File B has over 13 million lines

I pick the current line being read from FILE A, grep it on FILE B, and if the line is present in File B i will write it to the result file. By the way, File A and File B have different formats. Result file will have the format of File A.

[EDIT 2]: I was asked at work to create a bash solution ideally so that we don't have to install python on all the machines this has to run on.

This is my curent implementation:

#!/bin/bash

LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'`
LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'`

while read -r line; do
   MATCH="$(grep $line $LAST_EXP)"
   echo "line: $line, match: $MATCH"

   # if not empty
   if [ ! -z "$MATCH" ]
   then
      echo $MATCH >> result
   fi

done < $LAST_TTP

This bash approach is taking over 10 hours to complete. Do you have any suggestions on how to make it more efficient in bash?

Thanks a lot in advance!

like image 993
user1155413 Avatar asked Nov 30 '22 06:11

user1155413


2 Answers

You're probably looking in a list instead of a set, leading to an O(n²) performance. Try:

with open('b') as b:
  blines = set(b)
with open('a') as a:
  with open('result', 'w') as result:
    for line in a:
      if line not in blines:
        result.write(line)

Assuming uniformly long (and not overly long lines), the performance of this implementation is in O(|A| + |B|) (amortized, due to Python's set being extremely fast). The memory demand is in O(|B|), but with a factor significantly greater than 1.

If the order of lines in the output does not matter, you can also sort both files and then compare them line-by-line. This will have a performance in the order of O(|A| log |A| + B log |B|). The memory demand will be in O(|A|+|B|), or more precisely, |A| + |B|.

like image 190
phihag Avatar answered Dec 04 '22 03:12

phihag


Sort each of the input files. Now read one line from each. If one line compares less than the other, output it as a difference and read the next line from that file. If both lines compare equal, read the next line from both files. If you read to the end of one file, all the lines in the other file are differences.

This is an O(n log n) algorithm as opposed to the O(n^2) algorithm you started with.

like image 39
Mark Ransom Avatar answered Dec 04 '22 04:12

Mark Ransom