Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to intersect two sets of longs that don't fit into memory?

I've got an interesting problem. I want to intersect two Sets of Longs in Java, with each set having 1B members - this is 4GB per set. This will not fit into memory on the server that I need to run it on.

I'm wondering what interesting methods are out there for solving this problem.

What I've come up with so far is reading subsets of each original set from disk that are small enough to fit into memory, then intersecting each subset, and writing those to disk temporarily. Finally, I could go through and intersect these subsets. I get a feeling that this may turn into a map reduce job.

Maybe you'll have some better ideas :) I doubt I'm the first person to have come up with this problem.

like image 314
cjroth Avatar asked Oct 20 '12 18:10

cjroth


2 Answers

  1. Sort both sets A and B separately.

  2. Take and remove first element from set A and first element from set B

  3. If they are equal, add to the result set.

  4. If item from one set is greater, take next item from the second set.

  5. Go to 2 as long as you don't reach end of any set.

The advantage of this method is that you never keep more then 2 longs in memory (except sorting). Sorting can be done effectively on disk (merge sort).

like image 105
Tomasz Nurkiewicz Avatar answered Sep 22 '22 16:09

Tomasz Nurkiewicz


Do a disk-based merge sort on both sets.

Once that's done you can simply go through the sorted files sequentially and record the intersections in a new file.

like image 40
Brian Roach Avatar answered Sep 20 '22 16:09

Brian Roach