I have a list of 120 million records of around 40/50 bytes each which is about 5.5/6 gigabytes of raw memory space not including any extra storage required to keep an array in memory.
I'd like to make sure this list is unique. The way I have tried to do it is create a Hashset<string> and add all the entries to it one by one.
When I get to about 33 million records I'm out of memory and the list creation slows to a crawl.
Is there a better way to sort this massive list of entries in a timely manner? The only solution I can think of is using an Amazon EC2 High-Memory Quadruple Extra Large Instance for an hour.
Thanks
Apply a filter to your data and click the filter arrow to see a list showing all the unique values within that particular column of data. A Pivot Table is another good way to list out unique values.
Too big for RAM, too small for a cluster. Small datasets are cool. You can load them into memory and manipulate them at will, no sweat. Massive datasets are also cool. They have lots of data and the promise of exciting models and analyses. You gladly pay the price of the required cluster just to handle all that goodness.
There are some techniques that you can use to handle big data that don’t require spending any money or having to deal with long loading times. This article will cover 3 techniques that you can implement using Pandas to deal with large size datasets. The first technique we will cover is compressing the data.
And much of the time you can actually do that, using a set of techniques that are sometimes called “out-of-core computation”. Why you need RAM at all. The easiest way to process data that doesn’t fit in memory: spending some money. The three basic software techniques for handling too much data: compression, chunking, and indexing.
If you're just trying to check for uniqueness, I would simply split the input sequence into buckets, and then check each bucket separately.
For example, assuming you're loading the data from a file, you could stream the input in, and write it out to 26 different files, one for each letter that record starts with (I'm naively assuming each record starts with A-Z - please adjust for your real situation). Then you can check each of those smaller files for uniqueness using something like your existing code - because none of them will be too large to fit into memory at a time. The initial bucketing guarantees that there won't be any duplicate entries which are in different buckets.
Of course, there are various different ways you could perform the bucketing, and different approaches will be effective for different data sets. You could bucket by hash code, for example - take the bottom 5 bits of the hash code to create 32 different buckets. That's likely to get a reasonably equal distribution of records between buckets, and doesn't make any assumptions about the input data. I only mentioned the "take the first letter approach" above as it's a simpler way of grasping the concept :)
Use bucket sort to sort the list, flushing some of the contents of the buckets out to disk regularly to avoid running out of memory. Then load each flushed bucket in sequence and either use your HashSet approach or sort it and check it that way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With