I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).
How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.
Any suggestions are appreciated.
You could try a command line diff tool or DiffUtils for Windows. Textpad also has a comparison tool integrated it the files are text. If you just need to detmine if the files are different (not what the differences are) use a checksum comparison tool that uses MD5 or SHA1.
For all file formats that Word can open, the Compare option in Word is the easiest to use.
Right-click on the first file. Click on “Select for Compare” from the menu. Proceed to right-click on the second file. Click on “Compare with Selected.
Open any two files (A, B) in Notepad++, which you want to compare. File B (new) gets compared to File A (old). Then, navigate to Plugins > Compare Menu > Compare. It shows the difference/comparison side by side, as shown in the screenshot.
If order matters, try the comm
utility. If order doesn't matter, sort file1 file2 | uniq -u
.
I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).
Notes:
1.I only store each line's hash to save space (and time if paging might occur)
2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again
3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.
4.I import hashlib because the built-in hash() function is too short to avoid conflicts.
import sys
import hashlib
file = []
lines = []
for i in range(2):
# open the files named in the command line
file.append(open(sys.argv[1+i], 'r'))
# stores the hash value and the line number for each line in file i
lines.append({})
# assuming you like counting lines starting with 1
counter = 1
while 1:
# assuming default encoding is sufficient to handle the input file
line = file[i].readline().encode()
if not line: break
hashcode = hashlib.sha512(line).hexdigest()
lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With