Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter a very, very large file

I have a very large unsorted file, 1000GB, of ID pairs

  1. ID:ABC123 ID:ABC124
  2. ID:ABC123 ID:ABC124
  3. ID:ABC123 ID:ABA122
  4. ID:ABC124 ID:ABC123
  5. ID:ABC124 ID:ABC126

I would like to filter the file for

1) duplicates

example
ABC123 ABC124
ABC123 ABC124

2) reverse pairs (discard the second occurrence)

example
ABC123 ABC124
ABC124 ABC123

After filtering, the example file above would look like

  1. ID:ABC123 ID:ABC124
  2. ID:ABC123 ID:ABA122
  3. ID:ABC124 ID:ABC126

Currently, my solution is this

my %hash;

while(my $line = <FH>){
     chomp $line; #remove \n
     my ($id1,$id2) = split / /, $line;
     if(exists $hash{$id1$1d2} || exists $hash{$id2$id1}){
            next;
     }
     else{
         $hash{$id1$id2} = undef ; ## store it in a hash
         print "$line\n";
      }
}

which gives me the desired results for smaller lists, but takes up too much memory for larger lists, as I am storing the hash in memory.

I am looking for a solution that will take less memory to implement. Some thoughts I have are

1) save the hash to a file, instead of memory

2) multiple passes over the file

3) sorting and uniquing the file with unix sort -u -k1,2

After posting on stack exchange cs, they suggested an external sort algorithm

like image 271
bio-boris Avatar asked May 09 '14 22:05

bio-boris


2 Answers

You could use map reduce for the tasks.

Map-Reduce is a framework for batch-processing that allows you to easily distribute your work among several machines, and use parallel processing without taking care of synchronization and failure tolerance.

map(id1,id2):
    if id1<id2:
        yield(id1,id2)
   else:
        yield(id2,id1)

reduce(id1,list<ids>):
   ids = hashset(ids) //fairly small per id
   for each id2 in ids:
       yield(id1,id2)

The map-reduce implementation will allow you to distribute your work on several machines with really little extra programming work required.
This algorithm also requires linear (and fairly small) number of traversals over the data, with fairly small amount of extra memory needed, assuming each ID is associated with a small number of other IDs.

Note that this will alter the order of pairs (make first id second in some cases)
If the order of original ids does matter, you can pretty easily solve it with an extra field.
Also note that the order of data is altered, and there is no way to overcome it when using map-reduce.

For better efficiency, you might want to add a combiner, which will do the same job as the reducer in this case, but if it will actually help depends a lot on the data.

Hadoop is an open source library that implements Map-Reduce, and is widely used in the community.

like image 159
amit Avatar answered Oct 04 '22 12:10

amit


Depending on the details of your data (see my comment on the question) a Bloom filter may be a simple way to get away with two passes. In the first pass insert every pair into the filter after ordering the first and the second value and generate a set of possible duplicates. In the second pass filter the file using the set of possible duplicates. This obviously requires that the set of (possible) duplicates is not itself large.

Given the characteristics of the data set - up to around 25 billion unique pairs and roughly 64 bit per pair - the result will be on the order of 200 GB. So you either need a lot of memory, many passes or many machines. Even a Bloom filter will have to be huge to yield a acceptable error rate.

sortbenchmark.org can provide some hints on what is required because the task is not to different from sorting. The 2011 winner used 66 nodes with 2 quadcore processors, 24 GiB memory and 16 500 GB disks each and sorted 1,353 GB in 59.2 seconds.

like image 45
Daniel Brückner Avatar answered Oct 04 '22 12:10

Daniel Brückner