Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly find differences between two large text files

I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).

How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.

Any suggestions are appreciated.

like image 746
jack Avatar asked Aug 23 '10 02:08

jack


People also ask

How do I compare two large text files?

You could try a command line diff tool or DiffUtils for Windows. Textpad also has a comparison tool integrated it the files are text. If you just need to detmine if the files are different (not what the differences are) use a checksum comparison tool that uses MD5 or SHA1.

What is the easiest way to compare two text files?

For all file formats that Word can open, the Compare option in Word is the easiest to use.

How do I compare two long files?

Right-click on the first file. Click on “Select for Compare” from the menu. Proceed to right-click on the second file. Click on “Compare with Selected.

How do I compare large files in Notepad++?

Open any two files (A, B) in Notepad++, which you want to compare. File B (new) gets compared to File A (old). Then, navigate to Plugins > Compare Menu > Compare. It shows the difference/comparison side by side, as shown in the screenshot.


2 Answers

If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.

like image 79
zwol Avatar answered Oct 16 '22 12:10

zwol


I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).

Notes:

1.I only store each line's hash to save space (and time if paging might occur)

2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again

3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.

4.I import hashlib because the built-in hash() function is too short to avoid conflicts.

import sys
import hashlib

file = []
lines = []
for i in range(2):
    # open the files named in the command line
    file.append(open(sys.argv[1+i], 'r'))
    # stores the hash value and the line number for each line in file i
    lines.append({})
    # assuming you like counting lines starting with 1
    counter = 1
    while 1:
        # assuming default encoding is sufficient to handle the input file
        line = file[i].readline().encode()
        if not line: break
        hashcode = hashlib.sha512(line).hexdigest()
        lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
        counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]
like image 38
max Avatar answered Oct 16 '22 14:10

max