Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashset memory overhead

Tags:

c#

memory

hashset

In C# program I have two Queues of longs, 26M elements together and four HashSets of longs, 50M elements together. So my containers are storing 75M longs which gives 600MB of data. Memory usage of program is 3GB.

Why these containers need so much memory? What is memory complexity of HashSet? Even if all structures doubles theirs capacity, it will give 1.2GB, not 3GB.

EDIT: Yes, I didn't mean complexity. How much additional memory HashSet needs to store long? Simple binary heap doesn't need any additional memory. Is there any alternative for HashSet if I need to lower memory usage or I need to implement one myself?

like image 958
Ari Avatar asked Jul 20 '14 22:07

Ari


1 Answers

Overview

HashSet has 12 bytes of overhead per slot (which can contain an item or be empty). This overhead is 150% larger than the data size in the case of storing longs.

HashSet also holds empty slots for new data and the number of items in your example (~12.5 million items per HashSet) leads to about 66% higher memory usage just due to empty slots.

If you need O(1) confirmation of existence in the set then a HashSet is perhaps the best you can do. If you know something special about your data (e.g. that it contains "runs" of hundreds of items in a row) then you might be able to come up with a more clever way to represent this that requires less memory. Without knowing more about the data it's hard to make suggestions about that.

Test Program

    static void Main(string[] args)
    {
        var q = new Queue<long>();
        var hs = new []
        {
            new HashSet<long>(),
            new HashSet<long>(),
            new HashSet<long>(),
            new HashSet<long>()
        };

        for (long i = 0; i < 25000000; ++i)
        {
            q.Enqueue(i);

            if (i < 12500000)
            {
                foreach (var h in hs)
                {
                    h.Add(i);
                }
            }
        }

        Console.WriteLine("Press [enter] to exit");
        Console.ReadLine();
    }

HashSet Implementation - Mono

Slot Allocation Strategy - Doubles the size of the table on each allocation.

https://github.com/mono/mono/blob/master/mcs/class/System.Core/System.Collections.Generic/HashSet.cs

HashSet Implementation - MSFT

Slot Allocation Strategy - Allocates using primes. This can lead to substantial amounts of empty space, but reduces the number of times that the table must be reallocated and rehashed.

http://referencesource.microsoft.com/#System.Core/System/Collections/Generic/HashSet.cs

Memory Usage - General Sizing - Mono Implementation

  • Initial Size: 10 Slots
  • Fill Factor: 90% Full Triggers Resize
  • Resize Factor: Doubles size when fill factor reached

Memory Usage - Per Slot - Both Implementations

  • Table: 1 x 4 byte int per slot = 4 bytes / slot
  • Links: 2 x 4 byte ints per slot = 8 bytes / slot
  • Slots: 1 x (sizeof T) bytes = 8 bytes / slot (for T = long)
  • Total: 20 bytes / slot

Slots Used in Example - Mono

The example has 12.5 million items in each HashSet.

slots = 10 * 2 ^ ceiling(log2(items / 10))

log2(12,500,000 /10) ~= 20.5

slots ~= 21 million

Memory Used in Example - Computed - Mono

Queue: 25 million longs * 8 bytes / long = 200 MB

Each HashSet: 21 million slots * 20 bytes / slot = 420 MB

All HashSets: 1.68 GB

Total: 1.88 GB (+ empty space in the Large Object Heaps)

Memory Used in Example - Observed with Son of Strike - MSFT Implementation

3.5 GB memory in .Net heaps

400 MB of Int32 arrays (used by HashSet, not for our data storage)

2.5 GB of HashSet Slot objects

Note: MSFT's Slot object is 8 bytes plus the size of the data (8 bytes in this case), for 16 bytes total. 2.5 GB of Slot objects is 156 million Slots, for storing only 50 million items.

dumpheap -stat

!dumpheap -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ffb549af228        1           24 System.Collections.Generic.GenericEqualityComparer`1[[System.Int64, mscorlib]]
[snip]
00007ffb53e80bd8      159         6926 System.String
00007ffb53e81250       27        36360 System.Object[]
00000042ed0a8a30       22     48276686      Free
00007ffb53f066f0        3    402653256 System.Int64[]
00007ffb53e83768       14    431963036 System.Int32[]
00007ffaf5e17e88        5   2591773968 System.Collections.Generic.HashSet`1+Slot[[System.Int64, mscorlib]][]
Total 343 objects

eeheap -gc

!eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x00000042800472f8
generation 1 starts at 0x0000004280001018
generation 2 starts at 0x0000004280001000
ephemeral segment allocation context: none
 segment     begin allocated  size
0000004280000000  0000004280001000  000000428004b310  0x4a310(303888)
Large object heap starts at 0x0000004290001000
 segment     begin allocated  size
0000004290000000  0000004290001000  0000004290009728  0x8728(34600)
00000042dc000000  00000042dc001000  00000042e7717e70  0xb716e70(191983216)
000000433e6e0000  000000433e6e1000  000000434f9835b0  0x112a25b0(287974832)
00000043526e0000  00000043526e1000  000000435a6e1038  0x8000038(134217784)
000000435e6e0000  000000435e6e1000  0000004380c25c00  0x22544c00(575949824)
00000043826e0000  00000043826e1000  000000438826c788  0x5b8b788(95991688)
000000438a6e0000  000000438a6e1000  00000043acc25c00  0x22544c00(575949824)
00000043ae6e0000  00000043ae6e1000  00000043b426c788  0x5b8b788(95991688)
00000043b66e0000  00000043b66e1000  00000043d8c25c00  0x22544c00(575949824)
00000043da6e0000  00000043da6e1000  00000043e026c788  0x5b8b788(95991688)
00000043e26e0000  00000043e26e1000  0000004404c25c00  0x22544c00(575949824)
0000004298000000  0000004298001000  00000042a8001038  0x10000038(268435512)
Total Size:              Size: 0xcf1c1560 (3474724192) bytes.
------------------------------
GC Heap Size:            Size: 0xcf1c1560 (3474724192) bytes.
like image 166
huntharo Avatar answered Sep 25 '22 22:09

huntharo