Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# dictionary for large number of items

Tags:

c#

dictionary

I would like to understand the cost of storing a large number of items in the memory in C#. The data structure I need to use is a dictionary or similar. Let's say the number of items I would like to have is around 100 million, but the application will not immediately reach that number. It will take a long time till we are on the order of the limit.

I'm worried about the cost of operation that is amortised, but which I cannot afford to be too expensive at any given moment. So usually with dynamic data structures, when the structure is full, it re-allocates itself. In case of dictionary, I think it will even reindex every item. So let's say we are the point that application maintains 20 million items which just reaches the capacity of the dictionary. Then when a new dictionary storage is allocated, those 20 million items needs to be re-indexed.

This is why I was thinking that an array of dictionaries might be a good idea. Let's say I create 256 dictionaries.This immediately puts a bound on the size of each internal dictionary to less than 1 million items, which should be manageable to build up dynamically with all the indexing happening on the way up to 1 million items. The cost of that seems to be just one extra indexation per operation to find the correct dictionary to look into.

Is this a reasonable approach? Is my analysis correct or will the C# dictionary perform better I think for some reason? Is there any other solution that would be better? I'm looking for a data structure that has the same time complexity as C# dictionary.

Edit: The dictionary key is a random value, so I can just take the first bite of it to find my index into the array of 256 dictionaries very cheaply.

I'm currently not considering a database as I want all the items to be immediately available with very little cost. I do need look up in constant time with very little overhead. I can afford insert to be slower, but still constant time. Same with delete, can be little slower, but constant time required.

It should be possible to fit all the items in the memory. The items are small, about 50 bytes of data each. So the data structure must not have too much of an overhead for each item.

like image 931
Wapac Avatar asked Mar 06 '23 12:03

Wapac


1 Answers

UPDATE: I've edited this since I posted it:

  • Store a fixed size object (byte[50] with each pass,
  • pre-allocate all these before adding to the dictionary (rather than creating objects within the loop)
  • run GC.Collect() after pre-allocating stuff.
  • gcAllowVeryLargeObjects set to true.
  • definitely set for x64 (it was before, but then I switched to 'Release' to build and run outside VS... and it reset, so oops.)
  • tried it with and without pre-allocating the dictionary size.

Here is the code now:

var arrays = new byte[100000000][];
System.Diagnostics.Stopwatch stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
for (var i = 0; i<100000000; i++)
{
    arrays[i] = new byte[50];
}
stopwatch.Stop();
Console.WriteLine($"initially allocating arrays took {stopwatch.ElapsedMilliseconds} ms");
stopwatch.Restart();

GC.Collect();
Console.WriteLine($"GC after array allocation took {stopwatch.ElapsedMilliseconds} ms");

Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>(100000000);
//Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>();

for (var c = 0; c < 100; c++)
{
    stopwatch.Restart();
    for (var i = 0; i < 1000000; i++)
    {
        var thing = new AThing();
        dict.Add((c * 1000000) + i, arrays[(c*100000)+i]);
    }
    stopwatch.Stop();
    Console.WriteLine($"pass number {c} took {stopwatch.ElapsedMilliseconds} milliseconds");
}

Console.ReadLine();

Here is the output when I DON'T pre-allocate the size of the dictionary:

initially allocating arrays took 14609 ms
GC after array allocation took 3713 ms
pass number 0 took 63 milliseconds
pass number 1 took 51 milliseconds
pass number 2 took 78 milliseconds
pass number 3 took 28 milliseconds
pass number 4 took 32 milliseconds
pass number 5 took 133 milliseconds
pass number 6 took 41 milliseconds
pass number 7 took 31 milliseconds
pass number 8 took 27 milliseconds
pass number 9 took 26 milliseconds
pass number 10 took 45 milliseconds
pass number 11 took 335 milliseconds
pass number 12 took 34 milliseconds
pass number 13 took 35 milliseconds
pass number 14 took 71 milliseconds
pass number 15 took 66 milliseconds
pass number 16 took 64 milliseconds
pass number 17 took 58 milliseconds
pass number 18 took 71 milliseconds
pass number 19 took 65 milliseconds
pass number 20 took 68 milliseconds
pass number 21 took 67 milliseconds
pass number 22 took 83 milliseconds
pass number 23 took 11986 milliseconds
pass number 24 took 7948 milliseconds
pass number 25 took 38 milliseconds
pass number 26 took 36 milliseconds
pass number 27 took 27 milliseconds
pass number 28 took 31 milliseconds
..SNIP lots between 30-40ms...
pass number 44 took 34 milliseconds
pass number 45 took 34 milliseconds
pass number 46 took 33 milliseconds
pass number 47 took 2630 milliseconds
pass number 48 took 12255 milliseconds
pass number 49 took 33 milliseconds
...SNIP a load of lines which are all between 30 to 50ms...
pass number 93 took 39 milliseconds
pass number 94 took 43 milliseconds
pass number 95 took 7056 milliseconds
pass number 96 took 33323 milliseconds
pass number 97 took 228 milliseconds
pass number 98 took 70 milliseconds
pass number 99 took 84 milliseconds

you can clearly see the points where it has to re-allocate. I'm guessing simply by doubling the size of the list and copying the current list items, as there's a long stretch at the end where it doesn't do this. Some of these are very expensive though (30+ seconds! ouch)

and here is the output if I do pre-allocate the dictionary size:

initially allocating arrays took 15494 ms
GC after array allocation took 2622 ms
pass number 0 took 9585 milliseconds
pass number 1 took 107 milliseconds
pass number 2 took 91 milliseconds
pass number 3 took 145 milliseconds
pass number 4 took 83 milliseconds
pass number 5 took 118 milliseconds
pass number 6 took 133 milliseconds
pass number 7 took 126 milliseconds
pass number 8 took 65 milliseconds
pass number 9 took 52 milliseconds
pass number 10 took 42 milliseconds
pass number 11 took 34 milliseconds
pass number 12 took 45 milliseconds
pass number 13 took 48 milliseconds
pass number 14 took 46 milliseconds
..SNIP lots between 30-80ms...
pass number 45 took 80 milliseconds
pass number 46 took 65 milliseconds
pass number 47 took 64 milliseconds
pass number 48 took 65 milliseconds
pass number 49 took 122 milliseconds
pass number 50 took 103 milliseconds
pass number 51 took 45 milliseconds
pass number 52 took 77 milliseconds
pass number 53 took 64 milliseconds
pass number 54 took 96 milliseconds
..SNIP lots between 30-80ms...
pass number 77 took 44 milliseconds
pass number 78 took 85 milliseconds
pass number 79 took 142 milliseconds
pass number 80 took 138 milliseconds
pass number 81 took 47 milliseconds
pass number 82 took 44 milliseconds
..SNIP lots between 30-80ms...
pass number 93 took 52 milliseconds
pass number 94 took 50 milliseconds
pass number 95 took 63 milliseconds
pass number 96 took 111 milliseconds
pass number 97 took 175 milliseconds
pass number 98 took 96 milliseconds
pass number 99 took 67 milliseconds

Memory usage goes up to just over 9GB whilst initially creating the arrays, down to about 6.5GB after the GC.Collect, climbs back up to over 9GB whilst adding to the dictionary, then after all is done (and it's sat waiting on a console.Readline()) after a bit it drops back down to ~3.7GB and stays there.

It's clearly far faster in operation to pre-allocate the dictionary though.

For Reference, original below*

I just wrote this little test. I have no idea WHAT you're storing, so I've just created a little pointless class with not much info and used an int as the key, but my two takeaways from this are:

  1. it doesn't appear to get progressively slower at adding to the dictionary until it gets to around 40 million items. Running a 'Release' build targeted for x64 it took around 500ms for every million inserts, and then 41 through 46 took more like 700-850ms (noticable jump at that point)

  2. It got to somewhere over 46,000,000 items, ate about 4GB of RAM and fell over with an Out Of memory exception.

  3. Use a database, or the Dictionary abuse team will come and take you down.

code:

class AThing
{
    public string Name { get; set; }
    public int id { get; set; }
}
class Program
{
    static void Main(string[] args)
    {
        Dictionary<int, AThing> dict = new Dictionary<int, AThing>();

        for (var c = 0; c < 100; c++)
        {
            DateTime nowTime = DateTime.Now;
            for (var i = 0; i < 1000000; i++)
            {
                var thing = new AThing { id = (c * 1000000) + i, Name = $"Item {(c * 1000000) + i}" };
                dict.Add(thing.id, thing);
            }
            var timeTaken = DateTime.Now - nowTime;
            Console.WriteLine($"pass number {c} took {timeTaken.Milliseconds} milliseconds");
        }

    }
}
like image 85
GPW Avatar answered Mar 15 '23 13:03

GPW