What causes this strange drop in performance with a *medium* number of items?

Question

I have just read an article by Rico Mariani that concerns with performance of memory access given different locality, architecture, alignment and density.

The author built an array of varying size containing a doubly linked list with an int payload, which was shuffled to a certain percentage. He experimented with this list and found some consistent results on his machine.

Quoting one of the result table:

Pointer implementation with no changes
sizeof(int*)=4   sizeof(T)=12
  shuffle   0%  1%  10% 25% 50% 100%
    1000    1.99    1.99    1.99    1.99    1.99    1.99
    2000    1.99    1.85    1.99    1.99    1.99    1.99
    4000    1.99    2.28    2.77    2.92    3.06    3.34
    8000    1.96    2.03    2.49    3.27    4.05    4.59
   16000    1.97    2.04    2.67    3.57    4.57    5.16
   32000    1.97    2.18    3.74    5.93    8.76    10.64
   64000    1.99    2.24    3.99    5.99    6.78    7.35
  128000    2.01    2.13    3.64    4.44    4.72    4.80
  256000    1.98    2.27    3.14    3.35    3.30    3.31
  512000    2.06    2.21    2.93    2.74    2.90    2.99
 1024000    2.27    3.02    2.92    2.97    2.95    3.02
 2048000    2.45    2.91    3.00    3.10    3.09    3.10
 4096000    2.56    2.84    2.83    2.83    2.84    2.85
 8192000    2.54    2.68    2.69    2.69    2.69    2.68
16384000    2.55    2.62    2.63    2.61    2.62    2.62
32768000    2.54    2.58    2.58    2.58    2.59    2.60
65536000    2.55    2.56    2.58    2.57    2.56    2.56

The author explains:

This is the baseline measurement. You can see the structure is a nice round 12 bytes and it will align well on x86. Looking at the first column, with no shuffling, as expected things get worse and worse as the array gets bigger until finally the cache isn't helping much and you have about the worst you're going to get, which is about 2.55ns on average per item.

But something quite strange can be seen around 32k items:

The results for shuffling are not exactly what I expected. At small sizes, it makes no difference. I expected this because basically the entire table is staying hot in the cache and so locality isn't mattering. Then as the table grows you see that shuffling has a big impact at about 32000 elements. That's 384k of data. Likely because we've blown past a 256k limit.

Now the bizarre thing is this: after this the cost of shuffling actually goes down, to the point that later on it hardly matters at all. Now I can understand that at some point shuffled or not shuffled really should make no difference because the array is so huge that runtime is largely gated by memory bandwidth regardless of order. However... there are points in the middle where the cost of non-locality is actually much worse than it will be at the endgame.

What I expected to see was that shuffling caused us to reach maximum badness sooner and stay there. What actually happens is that at middle sizes non-locality seems to cause things to go very very bad... And I do not know why :)

So the question is: What might have caused this unexpected behavior?

I have thought about this for some time, but found no good explanation. The test code looks fine to me. I don't think CPU branch prediction is the culprit in this instance, as it should be observable far earlier than 32k items, and show a far slighter spike.

I have confirmed this behavior on my box, it looks pretty much exactly the same.

I figured it might be caused by forwarding of CPU state, so I changed the order of rows and/or column generation - almost no difference in output. To make sure, I generated data for a larger continuous sample. For easy of viewing, I put it into excel:

enter image description here

And another independent run for good measure, negligible difference

Rico Mariani · Accepted Answer

I put my best theory here: http://blogs.msdn.com/b/ricom/archive/2014/09/28/performance-quiz-14-memory-locality-alignment-and-density-suggestions.aspx#10561107 but it's just a guess, I haven't confirmed it.

Mystery solved! From my blog:

Ryan Mon, Sep 29 2014 9:35 AM #

Wait - are you concluding that completely randomized access is the same speed as sequential for very large cases? That would be very surprising!!

What's the range of rand()? If it's 32k that would mean you're just shuffling the first 32k items and doing basically sequential reads for most items in the large case, and the per-item avg would become very close to the sequential case. This matches your data very well.

Mon, Sep 29 2014 10:57 AM #

That's exactly it!

The rand function returns a pseudorandom integer in the range 0 to RAND_MAX (32767). Use the srand function to seed the pseudorandom-number generator before calling rand.

I need a different random number generator!

I'll redo it!

What causes this strange drop in performance with a medium number of items?

Tags:

performance

memory

mafu

1 Answers

Rico Mariani

Recent Activity

Donate For Us