Quickly find differences between two large text files

Tags:

I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).

How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.

Any suggestions are appreciated.

746

asked Aug 23 '10 02:08

jack

2 Answers

If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.

answered Oct 16 '22 12:10

zwol

I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).

Notes:

1.I only store each line's hash to save space (and time if paging might occur)

2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again

3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.

4.I import hashlib because the built-in hash() function is too short to avoid conflicts.

import sys
import hashlib

file = []
lines = []
for i in range(2):
    # open the files named in the command line
    file.append(open(sys.argv[1+i], 'r'))
    # stores the hash value and the line number for each line in file i
    lines.append({})
    # assuming you like counting lines starting with 1
    counter = 1
    while 1:
        # assuming default encoding is sufficient to handle the input file
        line = file[i].readline().encode()
        if not line: break
        hashcode = hashlib.sha512(line).hexdigest()
        lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
        counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]

answered Oct 16 '22 14:10

max

Related questions
                            
                                What's the life-time of a thread-local value in Python?
                            
                                How to auto log into gmail atom feed with Python?
                            
                                How can I translate this XPath expression to BeautifulSoup?
                            
                                DLL file loaded twice with DLL redirection through manifest
                            
                                Use distribute/setuptools to create symlink (or run script)?
                            
                                How to iterate over a the attributes of a class, in the order they were defined? [duplicate]
                            
                                In Python, what's a good pattern for disabling certain code during unit tests?
                            
                                Python: Indexing list for element in nested list
                            
                                Mocking ImportError in Python
                            
                                Twisted: how-to bind a server to a specified IP address?
                            
                                Get node name with minidom
                            
                                Porting Django's templates engine to C
                            
                                Pickling Django request objects
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character when trying a HTTP POST in Python
                            
                                How do I detect the currently focused application?
                            
                                SQLAlchemy many-to-many relationship on declarative tables
                            
                                Generate UUID for Cassandra in Python
                            
                                Django Celery implementation - OSError : [Errno 38] Function not implemented
                            
                                Accessing a decorator in a parent class from the child in Python
                            
                                What exactly does a non-shallow filecmp.cmp do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Quickly find differences between two large text files

Tags:

python

file

text

compare

diff

jack

People also ask

2 Answers

zwol

max

Recent Activity

Donate For Us