Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a more efficient way to reconcile large data sets?

I've been tasked with reconciling two big data sets (two big lists of transactions). Basically i extract the relevant fields from the two data sources into two files of the same format, then compare the files to find any records that are in A but not in B, or vice versa, and report on them. I wrote a blog entry on my best efforts achieving this (click if interested).

The gist of it is to load both data sets into a big hash table, with the keys being the rows, and the values being +1 each time it appears in file A, and -1 each time it appears in file B. Then at the end, i look for any key/value pairs where the value != 0.

My algorithm seems fast enough (10 seconds for 2*100mb files), however its a bit memory-intensive: 280mb to compare two sets of 100mb files, i would hope to get it down to 100mb peak memory usage, and possibly lower if the two data sets are sorted in roughly the same order.

Any ideas?

Also, let me know if this is too open ended for SO.

like image 234
Chris Avatar asked Oct 15 '22 15:10

Chris


2 Answers

I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.

Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.

Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).

If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.

The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.

like image 104
Kevin Nisbet Avatar answered Oct 18 '22 13:10

Kevin Nisbet


The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.

like image 34
mezoid Avatar answered Oct 18 '22 14:10

mezoid