I've got an interesting problem. I want to intersect two Sets of Longs in Java, with each set having 1B members - this is 4GB per set. This will not fit into memory on the server that I need to run it on.
I'm wondering what interesting methods are out there for solving this problem.
What I've come up with so far is reading subsets of each original set from disk that are small enough to fit into memory, then intersecting each subset, and writing those to disk temporarily. Finally, I could go through and intersect these subsets. I get a feeling that this may turn into a map reduce job.
Maybe you'll have some better ideas :) I doubt I'm the first person to have come up with this problem.
Sort both sets A
and B
separately.
Take and remove first element from set A and first element from set B
If they are equal, add to the result set.
If item from one set is greater, take next item from the second set.
Go to 2 as long as you don't reach end of any set.
The advantage of this method is that you never keep more then 2 longs in memory (except sorting). Sorting can be done effectively on disk (merge sort).
Do a disk-based merge sort on both sets.
Once that's done you can simply go through the sorted files sequentially and record the intersections in a new file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With