Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does a hashset do with memory when initializing a collection?

I stumbled upon the following problem.
I want a hashset with all numbers from 1 to 100.000.000. I tried the following code:

var mySet = new HashSet<int>();
for (var k = 1; k <= 100000000; k++)
     mySet.Add(k);

That code didn't make it since I got memory overflow somewhere around the 49mil. This was also pretty slow and memory grew excessively.

Then I tried this.

var mySet = Enumerable.Range(1, 100000000).ToHashSet();

where ToHashSet() is the following code:

public static HashSet<T> ToHashSet<T>(this IEnumerable<T> source)
{
    return new HashSet<T>(source);
}

I got a memory overflow again but I was able to put in more numbers then with the previous code.

The thing that does work is the following:

var tempList = new List<int>();
for (var k = 1; k <= 100000000; k++)
     tempList.Add(k);

var numbers = tempList.ToHashSet();

It takes about 800ms on my system to just fill the tempList where the Enumerable.Range() only takes 4 ticks!

I do need that HashSet or else it would take to much time to lookup values (I need it to be O(1)) and it would be great if I could do that the fastest way.

Now my question is:
Why do the first two methods cause a memory overflow where the third doesn't?

Is there something special HashSet does with the memory on initializing?

My system has 16GB memory so i was quite surprised when I got the overflow exceptions.

like image 547
Mixxiphoid Avatar asked Jul 19 '12 08:07

Mixxiphoid


1 Answers

Like other collection types, the HashSet will automatically increase its capacity as required as you add elements. When adding a large number of elements, this will result in a large number of reallocations.

If you initialize it with a constructor that takes an IEnumerable<T>, it will check if the IEnumerable<T> is in fact an ICollection<T>, and if so, initialize the HashSet's capacity to the size of the collection.

This is what's happening in you're third example - you're adding a List<T> which is also an ICollection<T>, so your HashSet is given an initial capacity equal to the size of the list, thus ensuring that no reallocations are needed.

You will be even more efficient if you use the List<T> constructor that takes a capacity parameter, as this will avoid reallocations when building the list:

var noElements = 100000000;
var tempList = new List<int>(noElements); 
for (var k = 1; k <= noElements; k++) 
     tempList.Add(k); 

var numbers = tempList.ToHashSet(); 

As for your system memory; check if this is a 32-bit or 64-bit process. A 32-bit process has a maximum of 2GB memory available (3GB if you've used the /3GB startup switch).

Unlike other collection types (e.g. List<T>, Dictionary<TKey,TValue>), HashSet<T> doesn't have a constructor that takes a capacity parameter to set the initial capacity. If you want to initialize a HashSet<T> with a large number of elements, the most efficient way to do so is probably to first add the elements to an array or List<T> with the appropriate capacity, then pass this array or list to the HashSet<T> constructor.

like image 65
Joe Avatar answered Oct 11 '22 16:10

Joe