Faster way to compute difference between two csv files

Question

I'm trying to compute difference between two large csv files (~ 4GB) to obtain newly added rows and writing these into an output csv file. I'm able to obtain this functionality for relatively small files (~50MB) by using the following code.

input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open(output_path, 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

However, for larger files, the above code is either too slow (runs for about half an hour) or crashes with lack of memory space.

Is there a faster way to obtain the difference for large csv files?

Jean-François Fabre · Accepted Answer

You don't have to read the second file fully, just read line by line.

And for the speed, just make a set out of the first file (fast lookup, and saves memory if there are duplicate lines). For this you have to keep the second file open when you're writing the result:

input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

with open(input_file1, 'r') as t1:
    fileone = set(t1)

with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
    for line in t2:
        if line not in fileone:
            outFile.write(line)

for line in t2 reads the file line by line (always avoid readlines() if you can) so even if the file is big, the memory footprint is small.
fileone takes some memory, true, but hopefully if it's smaller and/or has duplicate lines, not so much, and of course less than with readlines()
if line not in fileone may look the same as before, but it has an average O(1) complexity, which makes the program much faster

Bruce Martin · Answer

You could use Data base or alternatively a Sort Merge. I will give you the basic algorithm (rather than python Code)

Sort merge Description

The idea is to sort the 2 files into the same order. Then read sequentially through the 2 files

if records in the 2 files are equal --> in both files
if old-file-record > new-file-record --> record has been inserted
if old-file-record < new-file-record --> record has been deleted

Sort merge Algorithm

Sort the 2 files to new SortedFiles using the Operating Systems sort 
(use the whole record as sort key)

Open/Read  SortedOldFile
Open/Read  SortedNewFile

while (not end-of-file-SortedOldFile) and (not end-of-file-SortedOldFile):
    if SortedOldFile.record < SortedNewFile.record:
       ## Deleted processing goes here
       read SortedOldFile
    elseif SortedOldFile.record > SortedNewFile.record:
       ## Insert processing  goes here
       read SortedNewFile
    else
       read SortedOldFile
       read SortedNewFile

while (not end-of-file-SortedOldFile):
   ## Deleted processing
   read SortedOldFile

while (not end-of-file-SortedNewFile):
   ## Insert processing
   read SortedNewFile

Advantages:

uses minimal memory
it scales to absolutely massive files
Should be quick enough, Operating System Sorts are very efficient

Disadvantages:

Uses extra disk space (disk space is cheap these days)
code is Operating System dependent

Faster way to compute difference between two csv files

Tags:

python

csv

iprof0214

2 Answers

Jean-François Fabre

Sort merge Description

Sort merge Algorithm

Bruce Martin

Recent Activity

Donate For Us

Faster way to compute difference between two csv files

Tags:

python

csv

iprof0214

2 Answers

Jean-François Fabre

Sort merge Description

Sort merge Algorithm

Bruce Martin

Related questions

Recent Activity

Donate For Us