Merging 2 very large text files, update each line, without using memory

Question

Say I've got 2 text files with around 2 million lines each (~50-80MB file size each). The structure of both files is the same:

Column1 Column2 Column3
...

Column 1 never changes, Column 2: the same value may not be in both files, and won't be in the same order for both files, Column3 is a number and will be different in every file.

I need to be able to merge them both into one file, matched by Column 2. If Column2 exists in both files, update Column3 by adding the values of Column3 from both files together.

If the files weren't so huge, I could easily do this in PHP by reading each line of both files into arrays and going from there, but doing so easily overloads the memory available.

Is there a way to do this with without loading each line into memory? I'm mostly familiar with PHP, but open to Python, Java or Shell scripts if they are not too complicated to understand.

Jimmy · Accepted Answer

I'd go with command line sort(1)to merge and sort the files . After that, it should be a simple script to compute the sums. I don't know PHP, so I'll give my example in python:

sort -k2 <file1> <file2> | python -c "
  import itertools,sys
  allLines = (x.strip().split(' ') for x in sys.stdin)
  groups = itertools.groupby(allLines, lambda x:x[1])
  for k,lines in groups:
      firstLine = iter(g).next()
      print firstLine[0], firstline[1], sum(int(x[2]) for x in lines)
"

Marc B · Answer

Ok, so if I'm reading this right, you'll have:

file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  <-- common column 2
abc 12 67  <---common column 2
abc 23 1   <-- unique column 2

output should be:

abc 12 101
abc 90 99

If that's the case, then something like this (assuming they're .csv-formatted):

$f1 = fopen('file1.txt', 'rb');
$f2 = fopen('file2.txt', 'rb');
$fout = fopen('outputxt.');

$data = array();
while(1) {
    if (feof($line1) || feof($line2)) {
        break; // quit if we hit the end of either file
    }

    $line1 = fgetcsv($f1);
    if (isset($data[$line1[1]])) {
       // saw the col2 value earlier, so do the math for the output file:
       $col3 = $line1[2] + $data[$line1[1]];
       $output = array($line[0], $line1[1], $col3);
       fputcsv($fout, $output);
       unset($data[$line1[1]]);
    } else {
       $data[$line1[1]] = $line1; // cache the line, if the col2 value wasn't seen already
    }

    $line2 = fgetcsv($f2);
    if (isset($data[$line2[1]])) {
       $col3 = $data[$line2[1]] + $line2[2];
       $newdata = array($line2[0], $line2[1], $col3);
       fputcsv($fout, $newdata);
       unset($data[$line2[1]]); // remove line from cache
    } else {
       $data[$line2[1]] = $line2;
    }
}

fclose($f1);
fclose($f2);
fclose($fout);

This is going off the top of my head, not tested, probably won't work, YMMV, etc...

It'd simplify things immensely if you pre-sort the two input files, so that column2 is used as the sort key. That'd keep the cache size down, as you'd know if you'd seen a matched value already and when to dump the earlier cached data.

Michael Lorton · Answer

What may throwing you is that you are looking at two files. There's no need for that. To use Mark's excellent example: file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  
abc 12 67  
abc 23 1

then

sort file1 file2 > file3

yields file3:

abc 12 34
abc 12 67  
abc 23 1
abc 56 78
abc 90 12
abc 90 87

Second week of CS-101 to reduce that down to its final form.

Merging 2 very large text files, update each line, without using memory

Tags:

python

merge

php

file-io

memory

Brucial

3 Answers

Jimmy

Marc B

Michael Lorton

Recent Activity

Donate For Us

Merging 2 very large text files, update each line, without using memory

Tags:

python

merge

php

file-io

memory

Brucial

3 Answers

Jimmy

Marc B

Michael Lorton

Related questions

Recent Activity

Donate For Us