I'm trying to compute difference between two large csv files (~ 4GB) to obtain newly added rows and writing these into an output csv file. I'm able to obtain this functionality for relatively small files (~50MB) by using the following code.
input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"
with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open(output_path, 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
However, for larger files, the above code is either too slow (runs for about half an hour) or crashes with lack of memory space.
Is there a faster way to obtain the difference for large csv files?
You don't have to read the second file fully, just read line by line.
And for the speed, just make a set
out of the first file (fast lookup, and saves memory if there are duplicate lines). For this you have to keep the second file open when you're writing the result:
input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"
with open(input_file1, 'r') as t1:
fileone = set(t1)
with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
for line in t2:
if line not in fileone:
outFile.write(line)
for line in t2
reads the file line by line (always avoid readlines()
if you can) so even if the file is big, the memory footprint is small.fileone
takes some memory, true, but hopefully if it's smaller and/or has duplicate lines, not so much, and of course less than with readlines()
if line not in fileone
may look the same as before, but it has an average O(1)
complexity, which makes the program much fasterYou could use Data base or alternatively a Sort Merge. I will give you the basic algorithm (rather than python Code)
The idea is to sort the 2 files into the same order. Then read sequentially through the 2 files
Sort the 2 files to new SortedFiles using the Operating Systems sort
(use the whole record as sort key)
Open/Read SortedOldFile
Open/Read SortedNewFile
while (not end-of-file-SortedOldFile) and (not end-of-file-SortedOldFile):
if SortedOldFile.record < SortedNewFile.record:
## Deleted processing goes here
read SortedOldFile
elseif SortedOldFile.record > SortedNewFile.record:
## Insert processing goes here
read SortedNewFile
else
read SortedOldFile
read SortedNewFile
while (not end-of-file-SortedOldFile):
## Deleted processing
read SortedOldFile
while (not end-of-file-SortedNewFile):
## Insert processing
read SortedNewFile
Advantages:
Disadvantages:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With