Creating a unique list from dataset too big to fit in memory

2 Answers

If you're just trying to check for uniqueness, I would simply split the input sequence into buckets, and then check each bucket separately.

For example, assuming you're loading the data from a file, you could stream the input in, and write it out to 26 different files, one for each letter that record starts with (I'm naively assuming each record starts with A-Z - please adjust for your real situation). Then you can check each of those smaller files for uniqueness using something like your existing code - because none of them will be too large to fit into memory at a time. The initial bucketing guarantees that there won't be any duplicate entries which are in different buckets.

Of course, there are various different ways you could perform the bucketing, and different approaches will be effective for different data sets. You could bucket by hash code, for example - take the bottom 5 bits of the hash code to create 32 different buckets. That's likely to get a reasonably equal distribution of records between buckets, and doesn't make any assumptions about the input data. I only mentioned the "take the first letter approach" above as it's a simpler way of grasping the concept :)

answered Nov 15 '22 08:11

Jon Skeet

Use bucket sort to sort the list, flushing some of the contents of the buckets out to disk regularly to avoid running out of memory. Then load each flushed bucket in sequence and either use your HashSet approach or sort it and check it that way.

answered Nov 15 '22 07:11

Amber

Related questions
                            
                                How to split a text file into multiple files?
                            
                                one decimal for string format
                            
                                High Volume, High Speed Textbox in .Net
                            
                                WPF - Trouble binding ToolTip text on custom user control
                            
                                DropDownList with the non-unique values is screwing up viewstate/postbacks
                            
                                How to develop addin for MS Outlook?
                            
                                Convert string to datetime in C#.net
                            
                                C# Threaded Tasks - cannot get return value from array of tasks
                            
                                C# code to automatically give IIS write access to a folder on Windows Server 2008? Currently throws exception
                            
                                Linq OrderBy Any Property
                            
                                Is there a built-in create an absolute (fully qualified) url from a relative path such as "~/page.aspx" given the current URL?
                            
                                .NET Runtime Serialization
                            
                                Order of calling constructors
                            
                                Ray and 3D Face Intersection
                            
                                Should I Use User Controls If I'm Not Going To Reuse The Code?
                            
                                Parallel.For in C#
                            
                                Static methods updating a Dictionary<T,U> in ASP.NET - is it safe to lock() on the dictionary itself?
                            
                                Developing on both 32 bit and 64 bit processors, how should I structure my project to automatically reference the right dlls?
                            
                                How to deal with a class than encapsulates a disposible instance?
                            
                                Round any n-digit number to (n-1) zero-digits

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating a unique list from dataset too big to fit in memory

Tags:

c#

.net

hashset

gary

People also ask

2 Answers

Jon Skeet

Amber

Recent Activity

Donate For Us