Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to compute difference between two csv files

Tags:

python

csv

I'm trying to compute difference between two large csv files (~ 4GB) to obtain newly added rows and writing these into an output csv file. I'm able to obtain this functionality for relatively small files (~50MB) by using the following code.

input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open(output_path, 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

However, for larger files, the above code is either too slow (runs for about half an hour) or crashes with lack of memory space.

Is there a faster way to obtain the difference for large csv files?

like image 443
iprof0214 Avatar asked Mar 07 '23 07:03

iprof0214


2 Answers

You don't have to read the second file fully, just read line by line.

And for the speed, just make a set out of the first file (fast lookup, and saves memory if there are duplicate lines). For this you have to keep the second file open when you're writing the result:

input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

with open(input_file1, 'r') as t1:
    fileone = set(t1)

with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
    for line in t2:
        if line not in fileone:
            outFile.write(line)
  • for line in t2 reads the file line by line (always avoid readlines() if you can) so even if the file is big, the memory footprint is small.
  • fileone takes some memory, true, but hopefully if it's smaller and/or has duplicate lines, not so much, and of course less than with readlines()
  • if line not in fileone may look the same as before, but it has an average O(1) complexity, which makes the program much faster
like image 103
Jean-François Fabre Avatar answered Mar 18 '23 09:03

Jean-François Fabre


You could use Data base or alternatively a Sort Merge. I will give you the basic algorithm (rather than python Code)

Sort merge Description

The idea is to sort the 2 files into the same order. Then read sequentially through the 2 files

  • if records in the 2 files are equal --> in both files
  • if old-file-record > new-file-record --> record has been inserted
  • if old-file-record < new-file-record --> record has been deleted

Sort merge Algorithm

Sort the 2 files to new SortedFiles using the Operating Systems sort 
(use the whole record as sort key)

Open/Read  SortedOldFile
Open/Read  SortedNewFile

while (not end-of-file-SortedOldFile) and (not end-of-file-SortedOldFile):
    if SortedOldFile.record < SortedNewFile.record:
       ## Deleted processing goes here
       read SortedOldFile
    elseif SortedOldFile.record > SortedNewFile.record:
       ## Insert processing  goes here
       read SortedNewFile
    else
       read SortedOldFile
       read SortedNewFile

while (not end-of-file-SortedOldFile):
   ## Deleted processing
   read SortedOldFile

while (not end-of-file-SortedNewFile):
   ## Insert processing
   read SortedNewFile

Advantages:

  • uses minimal memory
  • it scales to absolutely massive files
  • Should be quick enough, Operating System Sorts are very efficient

Disadvantages:

  • Uses extra disk space (disk space is cheap these days)
  • code is Operating System dependent
like image 40
Bruce Martin Avatar answered Mar 18 '23 07:03

Bruce Martin