Cache miss stress test: stunning results.. any explanations?

Question

In order to get the actual performance of a modern computer relatively to cache misses (how 'spread' is the data in memory), I conducted a simple test where I allocate 500 MB of RAM, and then perform a constant number of reads, and I perform that test with increasing byte offsets. Finally, I wrap over the end of the 1000 MB buffer when I reach it.

I'm quite surprised by the results. It looks like there is a cost barrier around 32 bytes, and another one around 32 KB. I guess this has something to do with L1/L2/L3 cache loads, or virtual memory page size? What stunned me the most is that there seems to be only about 16 completely different memory locations that are being cached. That's very low!!! Any explanations (OS, hardware)?

Here are the results on 3 different computers, from the fastest one to the cheapest one, followed by my simple test code (uses only standard libs).

16 GB RAM fast HP workstation (test in 32 bits Windows):

time=0.364000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.231000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.339000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=0.567000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=1.177000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=1.806000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=2.293000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=2.464000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=2.621000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=2.775000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=2.908000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=3.007000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=3.183000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=3.758000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=4.287000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=6.366000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=6.124000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=5.295000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=5.540000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=5.844000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=5.785000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=5.714000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=5.825000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=5.759000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=2.222000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.471000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.377000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.166000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

4 GB RAM MacBookPro 2010 (test in 32 bits Windows):

time=0.476000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.357000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.634000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=1.173000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=2.360000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=3.469000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=3.990000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=3.549000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=3.758000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=3.867000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=4.275000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=4.310000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=4.584000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=5.144000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=6.100000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=8.111000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=6.256000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=6.311000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=6.416000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=6.635000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=6.530000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=6.544000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=6.545000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=5.272000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=1.524000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.538000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.508000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.817000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

4GB RAM cheap Acer "family computer":

time=0.732000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.549000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.765000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=1.196000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=2.318000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=2.483000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=2.760000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=3.194000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=3.369000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=3.720000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=4.842000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=5.797000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=9.865000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=19.273000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=19.282000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=19.606000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=20.242000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=20.956000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=22.627000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=24.336000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=24.403000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=23.060000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=20.553000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=14.460000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=1.752000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.963000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.687000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.453000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define MEMBLOCSIZE ((2<<20)*500)//1000MB

int readMemory( int* data, int* dataEnd, int numReads, int incrementPerRead ) {
  int accum = 0;
  int* ptr = data;

  while(true) {
    accum += *ptr;
    if( numReads-- == 0)
      return accum;

    ptr += incrementPerRead;

    if( ptr >= dataEnd )
      ptr = data;
  }
}

int main()
{
  int* data = (int*)malloc(MEMBLOCSIZE);
  int* dataEnd = data+(MEMBLOCSIZE / sizeof(int));

  int numReads = (MEMBLOCSIZE / sizeof(int));
  int dummyTotal = 0;
  int increment = 1;
  for( int loop = 0; loop < 28; ++loop ) {
    int startTime = clock();

    dummyTotal += readMemory(data, dataEnd, numReads, increment);

    int endTime = clock();
    double deltaTime = double(endTime-startTime)/double(CLOCKS_PER_SEC);

    printf("time=%f byteIncrement=%d numReadLocations=%d numReads=%d
",
      deltaTime, increment*sizeof(int), MEMBLOCSIZE/(increment*sizeof(int)), numReads);

    increment *= 2;
  }
  //Use dummyTotal: make sure the optimizer is not removing my code...
  return dummyTotal == 666 ? 1: 0;
}

Based on some comments I modified my test to use only 250 MB of RAM, and to do 16 consecutive reads for each 'read' in case it activates the prefetching. It still has similar results, however it is the case that the last tests, the ones reading few distant locations, have a better performance (2 seconds instead of 5), so it is probably because prefetching was not activated with the initial test.

#define MEMBLOCSIZE 262144000//250MB

int readMemory( int* data, int* dataEnd, int numReads, int incrementPerRead ) {
  int accum = 0;
  int* ptr = data;

  while(true) {
    accum += *ptr;
    if( numReads-- == 0)
      return accum;

    //Do 16 consecutive reads
    for( int i = 1; i < 17; ++i )
      accum += *(ptr+i);

    ptr += incrementPerRead;

    if( ptr >= dataEnd+17 )
      ptr = data;
  }
}

Results for this updated test for MacBookPro 2010:

time=0.691000 byteIncrement=4 numReadLocations=65536000 numReads=65536000
time=0.620000 byteIncrement=8 numReadLocations=32768000 numReads=65536000
time=0.715000 byteIncrement=16 numReadLocations=16384000 numReads=65536000
time=0.827000 byteIncrement=32 numReadLocations=8192000 numReads=65536000
time=0.917000 byteIncrement=64 numReadLocations=4096000 numReads=65536000
time=1.440000 byteIncrement=128 numReadLocations=2048000 numReads=65536000
time=2.646000 byteIncrement=256 numReadLocations=1024000 numReads=65536000
time=3.720000 byteIncrement=512 numReadLocations=512000 numReads=65536000
time=3.854000 byteIncrement=1024 numReadLocations=256000 numReads=65536000
time=4.673000 byteIncrement=2048 numReadLocations=128000 numReads=65536000
time=4.729000 byteIncrement=4096 numReadLocations=64000 numReads=65536000
time=4.784000 byteIncrement=8192 numReadLocations=32000 numReads=65536000
time=5.021000 byteIncrement=16384 numReadLocations=16000 numReads=65536000
time=5.022000 byteIncrement=32768 numReadLocations=8000 numReads=65536000
time=4.871000 byteIncrement=65536 numReadLocations=4000 numReads=65536000
time=5.163000 byteIncrement=131072 numReadLocations=2000 numReads=65536000
time=5.276000 byteIncrement=262144 numReadLocations=1000 numReads=65536000
time=4.699000 byteIncrement=524288 numReadLocations=500 numReads=65536000
time=1.997000 byteIncrement=1048576 numReadLocations=250 numReads=65536000
time=2.118000 byteIncrement=2097152 numReadLocations=125 numReads=65536000
time=2.071000 byteIncrement=4194304 numReadLocations=62 numReads=65536000
time=2.036000 byteIncrement=8388608 numReadLocations=31 numReads=65536000
time=1.923000 byteIncrement=16777216 numReadLocations=15 numReads=65536000
time=1.084000 byteIncrement=33554432 numReadLocations=7 numReads=65536000
time=0.607000 byteIncrement=67108864 numReadLocations=3 numReads=65536000
time=0.622000 byteIncrement=134217728 numReadLocations=1 numReads=65536000

Damon · Accepted Answer

Note that most of the below, as any conclusions you drew, is speculative. Memory benchmarking is ultra complex, and a relatively naive benchmarking in a way like you have done it rarely gives a lot of definite information about the performance of a real program.

The primary "cost barrier" as you name it at 32 kiB is probably more at 64kiB (or a combination of both). Since you do not initialize the memory, Windows will pull in zero pages as you read them. The allocation granularity is 64 kiB, and pages are always "readied" (and prefetched if you memory map) in that size, even if only one of the pages in the 64 kiB range is moved into your working set. This is something I found out experimenting with memory mapping.

Your process working set as set by Windows is ridiculously small by default, therefore as you iterate over that memory block, you will constantly have page faults. Some are less expensive, only changing a flag in the page descriptor, others (at 64 kiB) are more expensive, pulling 16 new pages from the zero pool (or, in the worst case, if the pool is empty, zeroing pages). This may very well explain one of the "cost barriers" you see.

Another cost barrier is, as you correctly noticed, cache associativity. Different addresses at larger power-of-two offsets use the same cache entries. Given "unhealthy" offsets, one can cause the same cache lines being evicted over and over again. This is one of the two primary reasons why alignment is good, but excessive over-alignment is bad (the other one being no locality of data).

The cost barrier at 32 bytes is surprising, if anything, one could imagine it being at 64 bytes (crossing cache lines on your test architecture). Prefetching should for the most part eliminate this kind of stall, but prefetching is usually only activated (if you do not explicitly hint it) after the second cache line miss with a given stride.

This is perfectly acceptable for "real" programs, which either read just one location and another, or iterate over bulks of data sequentially. It may, on the other hand, easily give confusing results when doing artificial measurements. It may also be a possible explanation why you see a cost barrier at 32 kiB. If prefetching doesn't work, then that would be the point where you run out of L1 cache on a typical x86.

Cache miss stress test: stunning results.. any explanations?

Tags:

c++

c

memory

caching

hardware

Jerome CG

Video Answer

1 Answers

Damon

Recent Activity

Donate For Us

Cache miss stress test: stunning results.. any explanations?

Tags:

c++

c

memory

caching

hardware

Jerome CG

Video Answer

1 Answers

Damon

Related questions

Recent Activity

Donate For Us