Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does C# store arrays larger than 512 longs (4096 bytes) differently?

I did some benchmarks with collection types implemented in the .NET Framework.

From the Reference Source I know that List<T> uses an array to store contents. To avoid resizing the array with every insertion, the array length gets doubled every time the free space runs out.

Size - Time diagram for <code>List<T>.Add</code>

Now my benchmark would insert random long values into a List (see the figure above for the size - time - graph). There are obvious "lag spikes" at lists sizes like 128 or 256 where the internal array has to be re-allocated. But at a size of 512 (and 128, though?), there seems to be a really big lag and the time it takes to insert one item is durably increased.

In my understanding, the graph should be strictly constant, except the occasions where the internal array needs to be reallocated. Are there any reasons for this behavior, perhaps related to the CLR or Windows memory management / memory fragmentation?

The benchmarks were executed as 64bit application on a Windows 10 / i7-3630QM machine (source code as seen below). Because one single add operation wouldn't be measurable, I create 1000 lists and add one item each for every list size.

for (int i = 1; i <= MaxCollectionSize; i++)
{
    // Reset time measurement
    TestContainer.ResetSnapshot();

    // Enable time measurement
    TestContainer.BeginSnapshot();
    // Execute one add operation on 1000 lists each
    ProfileAction.Invoke(TestContainer);
    TestContainer.EndSnapShot();

    double elapsedMilliseconds = (TestContainer.GetElapsedMilliSeconds() / (double)Stopwatch.Frequency) * 1000;
    // ...
}

EDIT: I double checked my results and yes, they are reproducible. I increased the number of collections tested from 1000 to 10000, and the result is now much more smooth (see image below). The spikes from resizing the internal array are now clearly visible. And yet the steps in the graph remain - this is a divergence with the expected O(1) complexity that an array insert should be if you ignore the resizing.

I also tried to trigger a GC collection before each Add operation and the graph remained exactly the same.

As for the concerns about the creation of delegate objects: All my delegates (like ProfileAction) are instance properties that remain assigned during one complete test cycle, in this case 10000 lists with 1000 add operations each.

enter image description here

like image 951
Franz Wimmer Avatar asked Jun 30 '16 11:06

Franz Wimmer


People also ask

How long can you have hep C without knowing?

Chronic liver disease in people with hepatitis C usually happens slowly, without any signs or symptoms, over several decades. Chronic hepatitis C virus infection is often not recognized until people are screened for blood donation or from an abnormal blood test found during a routine doctor's visit.

Does hep C go away?

Hepatitis C virus (HCV) causes both acute and chronic infection. Acute HCV infections are usually asymptomatic and most do not lead to a life-threatening disease. Around 30% (15–45%) of infected persons spontaneously clear the virus within 6 months of infection without any treatment.

How easy is it to get hep C?

Hepatitis C is spread only through exposure to an infected person's blood. High-risk activities include: Sharing drug use equipment. Anything involved with injecting street drugs, from syringes, to needles, to tourniquets, can have small amounts of blood on it that can transmit hepatitis C.

Does hep C treatment make you sick?

Like the other antivirals, the side effects are mild. You might have a slight headache or bellyache, or you might feel tired. Glecaprevir and pibrentasvir (Mavyret): Three pills daily can treat all types of hep C. Side effects are mild and can include headache, fatigue, diarrhea, and nausea.


2 Answers

Does C# store arrays larger than 512 longs (4096 bytes) differently?

No. It does when the total size is (IIRC) 84kB or more: the Large Object Heap is used (which is not compacting or generational).

However:

create 1000 lists and add one item each for every list size.

Giving times around ~5ms for each test. Windows scheduler delta is greater than that (actual values have been used from 40ms to 100ms depending on edition and version). Could you be seeing the scheduler performing a thread switch?

Suggest you try with each size being run for at least 250ms to even out these kind of effects.

EDIT: Also, as Lasse's comment on the question notes: this could be the GC. To eliminate that from your timings, at the start of the size loop, but before starting the clock, force a GC. Also monitor the GC performance counters.

like image 69
Richard Avatar answered Sep 17 '22 23:09

Richard


Okay, let's look at the simple parts of the picture first. The spikes are caused by reallocation, copying and garbage collection - no big surprise there. The anomalously low times for the few first additions to a list are easily explained by cache locality - while the heap still fits into memory whole, memory access can be random while still being very low latency. Once the heap gets big enough, and the array length value (as well as the list count value) gets far enough from the value being inserted, cache locality becomes a noticeable effect - in my testing on my machine in 32-bit x86 code, optimizing for cache locality improves the performance of the whole test fourfold.

However, while those effects nicely explain both the spikes themselves and the fact that operations after each spike take more time than before the spike, they don't really explain the trend that follows - there's no obvious reason why inserting the 600th element should take longer than inserting the 550th (assuming the last resize was at 512 or so). Profiling shows nicely that constant costs are quite high, but doesn't show something noticeably increasing over time.

My test code is cut down to the very basics:

var collections = new List<int>[100000];

for (var i = 0; i < collections.Length; i++)
{
  collections[i] = new List<int>();       
}

for (var i = 0; i < 1024; i++)
{
  for (var j = 0; j < collections.Length; j++)
  {
    collections[j].Add(i);
  }
}

Even though the only abstraction that remains is the Add itself, the trend is still visible in the test data, though I have to note that my curve is nowhere near as smooth as yours, and the deviations are huge. The typical cycle may take something around 20ms, while the spikes go as high as 5s.

Okay, time to look at the disassembly. My test code is very simple (just the inner loop body):

002D0532  mov         eax,dword ptr [ebp-18h]  
002D0535  mov         ecx,dword ptr [eax+esi*4+8]  
002D0539  mov         edx,ebx  
002D053B  cmp         dword ptr [ecx],ecx  
002D053D  call        7311D5F0  

collections reference is stored on the stack. Both i and j are in registers, as expected, and in fact, j is in esi, which is very handy. So, first we take the reference to collections, add j * 4 + 8 to get at the actual list reference, and store that in ecx (this in the method we're about to call). i is stored in ebx, but must be moved to edx to call Add - no big deal transfering a value between two general purpose registers, though :) Then there's the simple optimistic null check, and finally the call itself.

First thing to note is that there's no branching involved, so no branching mis-predictions. Second, we have two memory accesses - the first is on the stack, which is pretty much guaranteed to always be in the cache. The second is worse - that's where we get our cache locality issues. However, the lag from this depends entirely on the length (and amount of) the arrays, so should (and does) correlate with the array resizes.

Time to look at the Add method itself :) Remember, ecx contains the list instance, while edx has the item we're adding.

First, there's the usual method prologue, nothing special. Next, we check the array size:

8bf1    mov esi, ecx
8bfa    mov edi, edx
8b460c  mov eax, DWORD PTR [esi+0xc]    ; Get the list size
8b5604  mov edx, DWORD PTR [esi+0x4]    ; Get the array reference
3bf204  cmp eax, DWORD PTR [edx+0x4]    ; size == array.Length?
741c    je HandleResize ; Not important for us

We have three more memory accesses here. The first two are essentially identical, since the values being loaded are colocated closely enough. The array will only be colocated before the first array resize, which further improves cache performance on the first few inserts. Note that there's not a lot that the CPU can do in parallel here, but the three memory accesses should still only pay the latency cost once. The branch will almost always be predicted correctly - it's only taken once we reach the array size, after which we do the same branch once for each list.

Two pieces remain: adding the item itself, and updating the internal version of the list (to fail any ongoing enumerations on the list):

_items[_size++] = item;
_version++;

It's a bit wordier in assembly :)

8b5604  mov edx, DWORD PTR [esi+0x4]    ; get the array reference again
8b4e0c  mov ecx, DWORD PTR [esi+0xc]    ; ... and the list size
8d4101  lea eax, [ecx+0x1]  ; Funny, but the best way to get size + 1 :)
89460c  mov DWORD PTR [esi+0xc], eax    ; ... and store the new size back in the list object
3b4a04  cmp ecx, DWORD PTR [edx+0x4]    ; Array length check
7318    jae ThrowOutOfRangeException    ; If array is shorter than size, throw
897c8a08    mov DWORD PTR [edx+ecx*4+0x8], edi  ; Store item in the array
ff4610  inc DWORD PTR [esi+0x10]    ; Increase the version
; ... and the epilogue, not important

That's it. We have branch that will never be taken (assuming single-threaded; we already check the array size earlier). We have quite a few accesses: four that are relative to the list itself (including two updates), and two more on the array (including one update). Now, while there's no reason for a cache miss on the list (it almost always is already loaded), there are invalidations due to the updates. In contrast, the array access will always incur a cache miss in our scenario, with the only exception being before the first array resize. In fact, you can see that first there's no cache miss (array and object colocated, small), then one miss (still collocated, but item beyond the cache line), then two (both length and item access beyond the cache line).

This is certainly interesting (and could benefit a tiny bit from a manual optimization :P), but it again only gives us the "stairs" on the profiling data. Importantly, there's no allocations involved, so no GC.

With all this in hand, I'd conclude that the List.Add is indeed O(1) when no array resize is needed. For very small arrays (and arrays colocated with their refrence), there are a few extra optimizations that make things faster, but that's not important here.

So the trend you see in your profiling data must be either environmental, or directly related to the profiling itself, or just a badly chosen averaging method. For example, if I run this on 100 000 lists:

  1. Add the first 550 items
  2. Add another 100 items
  3. And another 100 items

There is variation between the time 2 and 3 takes, but no trend - it's just as likely for 2 to be faster as it is for 3 to be faster (on the order of a ~2ms difference on timespans of ~400ms, so about 0.5% deviation). And yet, if I do the "warmup" with 2100 items instead, the subsequent steps take almost half as long as before. Changing the amount of lists doesn't have a noticeable effect per-collection (as long as everything fits into your physical memory, of course :)).

Okay, this is very visible even with just a simple Stopwatch running outside of the debugger in release mode, and with simple sampling of the result data. So we can rule out both profiling effects and statistical errors.

But what could be the environmental cause?

  • GC doesn't get involved at all, outside of the array resizes. There's no allocations, and the profiler is very clear about the fact that no GC happened between resizes too (though this is of limited value with a concurrent GC :)). Tweaking the GC settings makes everything much slower, but again, only impacts the resize spikes and their close surroundings. Most importantly, the amount of lists (and thus the heap size) doesn't have any effect on the trend, which would be quite surprising if GC was the cause.
  • The heap is fragmented, but very orderly. This makes the relocations have less overhead under memory pressure, but again, only affects the array resizes. In any case, this is nothing surprising and in fact is well documented.

So, looking at all this... I have no idea why the trend exists. However, do note that the trend definitely isn't linear either - the increase falls off quickly as you increase list size. From about 15k items on, the trend disappears completely, so Add is indeed O(1) excluding array resizes - it just has some weird behaviour at some sizes :)

... unless you pre-allocate the lists. In which case, the results are 100% consistent with my predictions based just on cache locality. Which seems to suggest that the resizing and GCing patterns have huge effect on the efficiency of the usual caching algorithms (at least on my CPU - this will vary quite a bit, I recon). Remember when we talked about the cache misses incurred during the whole Add operation? There's a trick to it - if we can keep enough cache lines alive between the two loops, a cache miss will be averted very often; if we assume 64-byte cache line and optimal cache invalidation algorithms, you'll get no misses on the List member accesses and the array length accesses, with just one miss per array once in 16 adds. We don't need the rest of the array at all! There's a few other cache lines you need (for example, the list instances), but the arrays are by far the biggest deal.

Now, let's do the math. A hundred thousand collections, 2*64B of cache each in the worst case adds up to 12 MiB, and I have 10 MiB of cache on me - I can almost fit all the relevant array data in cache! Now, of course, I'm not the only application (and thread) using that cache, so we can expect the flip-over point to be somewhat below the ideal - let's see how changing the amount of collections changes our results.

Lists pre-allocated to 8000 items (32 kB), adding 2000 items, 100, 100

Lists   A       B   C
400     18      1   1
800     52      2   2
1600    120     6   6
3200    250     12  12
6400    506     25  25
12800   1046    52  53
25600   5821    270 270

Ha! Quite nicely visible. The timings nicely increase linearly with list count, until the last item - that's when our cache ran out. That's somewhere around 3-8 MiB of cache usage total - most likely the result of me neglecting some important thing that also needs caching, or some optimization on part of the OS or CPU to prevent me from hogging the whole cache or something :)

The slight un-linearity of the in the very small list counts are most likely related to the slow overflow of lower-level cache - 400 fits comfortably in my L2 cache, 800 already overflows a bit, 1600 quite a bit more and by the time we reach 3200, L2 cache can be neglected almost entirely.

And for our final check, the same scenario, but adding 4000 items instead of 2000:

Lists   A       B   C
400     42      1   1
800     110     3   2
1600    253     6   6
3200    502     12  12
6400    1011    25  25
12800   2091    52  53
25600   10395   250 250

As you can see, the item count has no effect on the insert time (per item) whatsoever, the whole trend just disappeared.

So, there you have it. The trend is caused by the GC indirectly (through sub-optimal allocation in your code and compaction patterns in the GC that disrupt cache locality) and cache-overflow directly. With smaller item counts, it's simply more likely that any given piece of required memory will be in cache right now. When arrays need to be resized, most of the cached memory is pretty much worthless, and will be slowly invalidated and replaced with more useful memory - but the whole memory usage pattern is something very far from what the CPU is optimized for. In contrast, by keeping the arrays preallocated, we ensure that once we have the list in memory, we also see the array length (bonus 1), and the cache lines already pointing to the end of the array will be useful for a few loops (bonus 2). Since there's no array resizing, none of these objects need to move in memory at all, and there's nice colocation.

like image 25
Luaan Avatar answered Sep 18 '22 23:09

Luaan