Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a unique list from dataset too big to fit in memory

Tags:

c#

.net

hashset

I have a list of 120 million records of around 40/50 bytes each which is about 5.5/6 gigabytes of raw memory space not including any extra storage required to keep an array in memory.

I'd like to make sure this list is unique. The way I have tried to do it is create a Hashset<string> and add all the entries to it one by one.

When I get to about 33 million records I'm out of memory and the list creation slows to a crawl.

Is there a better way to sort this massive list of entries in a timely manner? The only solution I can think of is using an Amazon EC2 High-Memory Quadruple Extra Large Instance for an hour.

Thanks

like image 970
gary Avatar asked Jan 05 '11 08:01

gary


People also ask

How do I list all unique values in a data set?

Apply a filter to your data and click the filter arrow to see a list showing all the unique values within that particular column of data. A Pivot Table is another good way to list out unique values.

Why do you prefer big datasets over small datasets?

Too big for RAM, too small for a cluster. Small datasets are cool. You can load them into memory and manipulate them at will, no sweat. Massive datasets are also cool. They have lots of data and the promise of exciting models and analyses. You gladly pay the price of the required cluster just to handle all that goodness.

How to handle large size datasets in pandas?

There are some techniques that you can use to handle big data that don’t require spending any money or having to deal with long loading times. This article will cover 3 techniques that you can implement using Pandas to deal with large size datasets. The first technique we will cover is compressing the data.

How can I process data that doesn't fit in memory?

And much of the time you can actually do that, using a set of techniques that are sometimes called “out-of-core computation”. Why you need RAM at all. The easiest way to process data that doesn’t fit in memory: spending some money. The three basic software techniques for handling too much data: compression, chunking, and indexing.


2 Answers

If you're just trying to check for uniqueness, I would simply split the input sequence into buckets, and then check each bucket separately.

For example, assuming you're loading the data from a file, you could stream the input in, and write it out to 26 different files, one for each letter that record starts with (I'm naively assuming each record starts with A-Z - please adjust for your real situation). Then you can check each of those smaller files for uniqueness using something like your existing code - because none of them will be too large to fit into memory at a time. The initial bucketing guarantees that there won't be any duplicate entries which are in different buckets.

Of course, there are various different ways you could perform the bucketing, and different approaches will be effective for different data sets. You could bucket by hash code, for example - take the bottom 5 bits of the hash code to create 32 different buckets. That's likely to get a reasonably equal distribution of records between buckets, and doesn't make any assumptions about the input data. I only mentioned the "take the first letter approach" above as it's a simpler way of grasping the concept :)

like image 72
Jon Skeet Avatar answered Nov 15 '22 08:11

Jon Skeet


Use bucket sort to sort the list, flushing some of the contents of the buckets out to disk regularly to avoid running out of memory. Then load each flushed bucket in sequence and either use your HashSet approach or sort it and check it that way.

like image 20
Amber Avatar answered Nov 15 '22 07:11

Amber