Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging 2 very large text files, update each line, without using memory

Say I've got 2 text files with around 2 million lines each (~50-80MB file size each). The structure of both files is the same:

Column1 Column2 Column3
...

Column 1 never changes, Column 2: the same value may not be in both files, and won't be in the same order for both files, Column3 is a number and will be different in every file.

I need to be able to merge them both into one file, matched by Column 2. If Column2 exists in both files, update Column3 by adding the values of Column3 from both files together.

If the files weren't so huge, I could easily do this in PHP by reading each line of both files into arrays and going from there, but doing so easily overloads the memory available.

Is there a way to do this with without loading each line into memory? I'm mostly familiar with PHP, but open to Python, Java or Shell scripts if they are not too complicated to understand.

like image 556
Brucial Avatar asked Aug 29 '11 22:08

Brucial


3 Answers

I'd go with command line sort(1)to merge and sort the files . After that, it should be a simple script to compute the sums. I don't know PHP, so I'll give my example in python:

sort -k2 <file1> <file2> | python -c "
  import itertools,sys
  allLines = (x.strip().split(' ') for x in sys.stdin)
  groups = itertools.groupby(allLines, lambda x:x[1])
  for k,lines in groups:
      firstLine = iter(g).next()
      print firstLine[0], firstline[1], sum(int(x[2]) for x in lines)
"
like image 102
Jimmy Avatar answered Oct 17 '22 13:10

Jimmy


Ok, so if I'm reading this right, you'll have:

file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  <-- common column 2
abc 12 67  <---common column 2
abc 23 1   <-- unique column 2

output should be:

abc 12 101
abc 90 99

If that's the case, then something like this (assuming they're .csv-formatted):

$f1 = fopen('file1.txt', 'rb');
$f2 = fopen('file2.txt', 'rb');
$fout = fopen('outputxt.');

$data = array();
while(1) {
    if (feof($line1) || feof($line2)) {
        break; // quit if we hit the end of either file
    }

    $line1 = fgetcsv($f1);
    if (isset($data[$line1[1]])) {
       // saw the col2 value earlier, so do the math for the output file:
       $col3 = $line1[2] + $data[$line1[1]];
       $output = array($line[0], $line1[1], $col3);
       fputcsv($fout, $output);
       unset($data[$line1[1]]);
    } else {
       $data[$line1[1]] = $line1; // cache the line, if the col2 value wasn't seen already
    }

    $line2 = fgetcsv($f2);
    if (isset($data[$line2[1]])) {
       $col3 = $data[$line2[1]] + $line2[2];
       $newdata = array($line2[0], $line2[1], $col3);
       fputcsv($fout, $newdata);
       unset($data[$line2[1]]); // remove line from cache
    } else {
       $data[$line2[1]] = $line2;
    }
}

fclose($f1);
fclose($f2);
fclose($fout);

This is going off the top of my head, not tested, probably won't work, YMMV, etc...

It'd simplify things immensely if you pre-sort the two input files, so that column2 is used as the sort key. That'd keep the cache size down, as you'd know if you'd seen a matched value already and when to dump the earlier cached data.

like image 1
Marc B Avatar answered Oct 17 '22 15:10

Marc B


What may throwing you is that you are looking at two files. There's no need for that. To use Mark's excellent example: file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  
abc 12 67  
abc 23 1  

then

sort file1 file2 > file3

yields file3:

abc 12 34
abc 12 67  
abc 23 1
abc 56 78
abc 90 12
abc 90 87  

Second week of CS-101 to reduce that down to its final form.

like image 1
Michael Lorton Avatar answered Oct 17 '22 13:10

Michael Lorton