I was inspired by this question to write a simple program to test my machine's memory bandwidth in each cache level:
Why vectorizing the loop does not have performance improvement
My code uses memset to write to a buffer (or buffers) over and over and measures the speed. It also saves the address of every buffer to print at the end. Here's the listing:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SIZE_KB {8, 16, 24, 28, 32, 36, 40, 48, 64, 128, 256, 384, 512, 768, 1024, 1025, 2048, 4096, 8192, 16384, 200000}
#define TESTMEM 10000000000 // Approximate, in bytes
#define BUFFERS 1
double timer(void)
{
struct timeval ts;
double ans;
gettimeofday(&ts, NULL);
ans = ts.tv_sec + ts.tv_usec*1.0e-6;
return ans;
}
int main(int argc, char **argv)
{
double *x[BUFFERS];
double t1, t2;
int kbsizes[] = SIZE_KB;
double bandwidth[sizeof(kbsizes)/sizeof(int)];
int iterations[sizeof(kbsizes)/sizeof(int)];
double *address[sizeof(kbsizes)/sizeof(int)][BUFFERS];
int i, j, k;
for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
iterations[k] = TESTMEM/(kbsizes[k]*1024);
for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
{
// Allocate
for (j = 0; j < BUFFERS; j++)
{
x[j] = (double *) malloc(kbsizes[k]*1024);
address[k][j] = x[j];
memset(x[j], 0, kbsizes[k]*1024);
}
// Measure
t1 = timer();
for (i = 0; i < iterations[k]; i++)
{
for (j = 0; j < BUFFERS; j++)
memset(x[j], 0xff, kbsizes[k]*1024);
}
t2 = timer();
bandwidth[k] = (BUFFERS*kbsizes[k]*iterations[k])/1024.0/1024.0/(t2-t1);
// Free
for (j = 0; j < BUFFERS; j++)
free(x[j]);
}
printf("TESTMEM = %ld\n", TESTMEM);
printf("BUFFERS = %d\n", BUFFERS);
printf("Size (kB)\tBandwidth (GB/s)\tIterations\tAddresses\n");
for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
{
printf("%7d\t\t%.2f\t\t\t%d\t\t%x", kbsizes[k], bandwidth[k], iterations[k], address[k][0]);
for (j = 1; j < BUFFERS; j++)
printf(", %x", address[k][j]);
printf("\n");
}
return 0;
}
And the results (with BUFFERS = 1):
TESTMEM = 10000000000
BUFFERS = 1
Size (kB) Bandwidth (GB/s) Iterations Addresses
8 52.79 1220703 90b010
16 56.48 610351 90b010
24 57.01 406901 90b010
28 57.13 348772 90b010
32 45.40 305175 90b010
36 38.11 271267 90b010
40 38.02 244140 90b010
48 38.12 203450 90b010
64 37.51 152587 90b010
128 36.89 76293 90b010
256 35.58 38146 d760f010
384 31.01 25431 d75ef010
512 26.79 19073 d75cf010
768 26.20 12715 d758f010
1024 26.20 9536 d754f010
1025 18.30 9527 90b010
2048 18.29 4768 d744f010
4096 18.29 2384 d724f010
8192 18.31 1192 d6e4f010
16384 18.31 596 d664f010
200000 18.32 48 cb2ff010
I can easily see the effect of the 32K L1 cache and 256K L2 cache. What I don't understand is why performance drops suddenly after the size of the memset buffer exceeds 1M. My L3 cache is supposed to be 8M. It happens so suddenly too, not tapered at all like when the L1 and L2 cache size was exceeded.
My processor is the Intel i7 3700. The details of the L3 cache from /sys/devices/system/cpu/cpu0/cache are:
level = 3
coherency_line_size = 64
number_of_sets = 8192
physical_line_partition = 1
shared_cpu_list = 0-7
shared_cpu_map = ff
size = 8192K
type = Unified
ways_of_associativity = 16
I thought I would try using multiple buffers - call memset on 2 buffers of 1M each and see if performance would drop. With BUFFERS = 2, I get:
TESTMEM = 10000000000
BUFFERS = 2
Size (kB) Bandwidth (GB/s) Iterations Addresses
8 54.15 1220703 e59010, e5b020
16 51.52 610351 e59010, e5d020
24 38.94 406901 e59010, e5f020
28 38.53 348772 e59010, e60020
32 38.31 305175 e59010, e61020
36 38.29 271267 e59010, e62020
40 38.29 244140 e59010, e63020
48 37.46 203450 e59010, e65020
64 36.93 152587 e59010, e69020
128 35.67 76293 e59010, 63769010
256 27.21 38146 63724010, 636e3010
384 26.26 25431 63704010, 636a3010
512 26.19 19073 636e4010, 63663010
768 26.20 12715 636a4010, 635e3010
1024 26.16 9536 63664010, 63563010
1025 18.29 9527 e59010, f59420
2048 18.23 4768 63564010, 63363010
4096 18.27 2384 63364010, 62f63010
8192 18.29 1192 62f64010, 62763010
16384 18.31 596 62764010, 61763010
200000 18.31 48 57414010, 4b0c3010
It appears that both 1M buffers stay in the L3 cache. But try to increase the size of either buffer ever so slightly and the performance drops.
I've been compiling with -O3. It doesn't make much difference (except possibly unrolling the loops over BUFFERS). I tried with -O0 and it's the same except for the L1 speeds. gcc version is 4.9.1.
To summarize, I have a 2-part question:
As suggested by Gabriel Southern, I ran my code with perf
using BUFFERS=1 with only one buffer size at a time. This was the full command:
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -r 100 ./a.out 2> perfout.txt
The -r
means that perf
will run a.out 100 times and return the average statistics.
The output of perf
, with #define SIZE_KB {1024}
:
Performance counter stats for './a.out' (100 runs):
1,508,798 dTLB-loads ( +- 0.02% )
0 dTLB-load-misses # 0.00% of all dTLB cache hits
625,967,550 dTLB-stores ( +- 0.00% )
1,503 dTLB-store-misses ( +- 0.79% )
0.360471583 seconds time elapsed ( +- 0.79% )
and with #define SIZE_KB {1025}
:
Performance counter stats for './a.out' (100 runs):
1,670,402 dTLB-loads ( +- 0.09% )
0 dTLB-load-misses # 0.00% of all dTLB cache hits
626,099,850 dTLB-stores ( +- 0.00% )
2,115 dTLB-store-misses ( +- 2.19% )
0.503913416 seconds time elapsed ( +- 0.06% )
So there does seem to be more TLB misses with the 1025K buffer. However, with this size buffer, the program does about 9500 calls of memset
, so it is still less than 1 miss per memset
call.
In other words, the 67% increase in cores nets you just 6% more performance, while the 67% increase in L3 cache nets you 18% more performance, making the extra cache far more useful in this scenario.
Level 3 (L3) cache is specialized memory developed to improve the performance of L1 and L2. L1 or L2 can be significantly faster than L3, though L3 is usually double the speed of DRAM. With multicore processors, each core can have dedicated L1 and L2 cache, but they can share an L3 cache.
The 8 MB you are talking about, is the amount of L3 cache found in some high level CPUs like i7 and some xeons. The optimal amount of cache is obtained by a calculus between the maximum amount of RAM for the system, the number of physical cores and the CPU cycles.
(Level 3 cache) A memory bank built onto the motherboard or within the CPU module. The L3 cache feeds the L2 cache, and its memory is typically slower than the L2 memory, but faster than main memory. The L3 cache feeds the L2 cache, which feeds the L1 cache, which feeds the processor.
Your version of memset
starts using non-temporal stores when initializing a region of memory larger than 1 MB. As a result the CPU does not store these lines in its cache, even though your L3 cache is larger than 1 MB. Consequently the performance is limited by the available memory bandwidth in the system for buffer values larger than 1 MB.
I tested the code you provided on several different systems and initially focused on investigating the TLB because I thought that there might be thrashing in the 2nd level TLB. However, none of the data I collected confirmed that hypothesis.
Some of the systems that I tested used Arch Linux which has the latest version of glibc, while others used Ubuntu 10.04 which uses an older version of eglibc. I was able to reproduce the behavior described in the question when using a statically linked binary when testing with multiple different CPU architectures. The behavior that I focused on was a significant difference in runtime between when SIZE_KB
was 1024
and when it was 1025
. The performance difference is explained by a change in the code executed for the slow and fast versions.
I used perf record
and perf annotate
to collect a trace of the executing assembly code to see what the hot code path was. The code is displayed below using the following format:
percentage time executing instruction | address | instruction
.
I've copied the hot loop from the shorter version that omits most of the address and has a line connecting the loop back edge and loop header.
For the version compiled on Arch Linux the hot loop was (for both 1024 and 1025 sizes):
2.35 │a0:┌─+movdqa %xmm8,(%rcx)
54.90 │ │ movdqa %xmm8,0x10(%rcx)
32.85 │ │ movdqa %xmm8,0x20(%rcx)
1.73 │ │ movdqa %xmm8,0x30(%rcx)
8.11 │ │ add $0x40,%rcx
0.03 │ │ cmp %rcx,%rdx
│ └──jne a0
For the Ubuntu 10.04 binary the hot loop when running with a size of 1024 was:
│a00:┌─+lea -0x80(%r8),%r8
0.01 │ │ cmp $0x80,%r8
5.33 │ │ movdqa %xmm0,(%rdi)
4.67 │ │ movdqa %xmm0,0x10(%rdi)
6.69 │ │ movdqa %xmm0,0x20(%rdi)
31.23 │ │ movdqa %xmm0,0x30(%rdi)
18.35 │ │ movdqa %xmm0,0x40(%rdi)
0.27 │ │ movdqa %xmm0,0x50(%rdi)
3.24 │ │ movdqa %xmm0,0x60(%rdi)
16.36 │ │ movdqa %xmm0,0x70(%rdi)
13.76 │ │ lea 0x80(%rdi),%rdi
│ └──jge a00
For the Ubuntu 10.04 version running with a buffer size of 1025 the hot loop was:
│a60:┌─+lea -0x80(%r8),%r8
0.15 │ │ cmp $0x80,%r8
1.36 │ │ movntd %xmm0,(%rdi)
0.24 │ │ movntd %xmm0,0x10(%rdi)
1.49 │ │ movntd %xmm0,0x20(%rdi)
44.89 │ │ movntd %xmm0,0x30(%rdi)
5.46 │ │ movntd %xmm0,0x40(%rdi)
0.02 │ │ movntd %xmm0,0x50(%rdi)
0.74 │ │ movntd %xmm0,0x60(%rdi)
40.14 │ │ movntd %xmm0,0x70(%rdi)
5.50 │ │ lea 0x80(%rdi),%rdi
│ └──jge a60
The key difference here is that the slower version was using movntd
instructions while the faster versions used movdqa
instructions. The Intel Software Developers manual says the following about non-temporal stores:
For WC memory type in particular, the processor never appears to read the data into the cache hierarchy. Instead, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache.
So this seems to explain the behavior where using memset
with values larger than 1 MB don't fit in the cache. The next question is why there is a difference between the Ubuntu 10.04 system and the Arch Linux system, and why 1 MB is selected as a cutoff point. To investigate that question I looked at the glibc source code:
memset
Looking at the glibc git repo at sysdeps/x86_64/memset.S
the first commit I found interesting was b2b671b677d92429a3d41bf451668f476aa267ed
The commit description is:
Faster memset on x64
This implementation speed up memset in several ways. First is avoiding expensive computed jump. Second is using fact that arguments of memset are most of time aligned to 8 bytes.
Benchmark results on: kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_result27_04_13.tar.bz2
And the website referenced has some interesting profiling data.
The diff of the commit shows that the code for memset
is simplified a lot and the non-temporal stores are removed. This matches what the profiled code from Arch Linux shows.
Looking at the older code I saw that the choice of whether to use non-temporal stores appeared to make use of a value described as The largest cache size
L(byte32sse2_pre):
mov __x86_shared_cache_size(%rip),%r9d # The largest cache size
cmp %r9,%r8
ja L(sse2_nt_move_pre)
The code for calculating this is in: sysdeps/x86_64/cacheinfo.c
Although it looks like there is code for calculating the actual shared cache size, the default value is also 1 MB:
long int __x86_64_shared_cache_size attribute_hidden = 1024 * 1024;
So I suspect that either the default value is being used, but there may be some other reason that the code is selecting 1MB as the cutoff point.
In either case the overall answer to your question appears to be that the version of memset
on your system is using non-temporal stores when setting a region of memory larger than 1 MB.
Given Gabriel's disassembly of the generated assembly code, I think this is indeed the problem [Edit: his answer was edited, it now appears as the root cause so we're in agreement]:
Note that movnt
is a streaming store, which may have (depending on the exact micro-architectural implementation) several impacts:
#1 and #2 may improve the latency and bandwidth of these operations if they're memory bound, but #3 basically forces them to be memory bound even if they could fit in some cache level. This probably surpasses the benefits, since the memory latency/BW are significantly worse to begin with.
So, your memset library implementation is probably using a wrong threshold for switching into streaming-stores version (I guess it doesn't bother checking your LLC size, but assuming 1M is memory resident is quite strange). I suggest trying alternative libraries, or disabling the compiler ability to generate them (if it's supported).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With