Understanding number of loads and stores retired in a x86 micro-benchmark

Question

I have been using perf recently and I got some results that I can not understand. Specifically the number or retired loads and stores do not match my expectations.

I did code a very simple micro-benchmark to see if results make sense in a very simple case:

#include <stdio.h>

#define STREAM_ARRAY_SIZE   10000000

static double   a[STREAM_ARRAY_SIZE],
    b[STREAM_ARRAY_SIZE],
    c[STREAM_ARRAY_SIZE];


int main(){

    ssize_t j;

    for (j=0; j<STREAM_ARRAY_SIZE; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
    }

    return 0;
}

I compiled with gcc 4.6.3:

gcc -Wall -O benchmark.c -o benchmark

and it does compile to a very simple piece of assembly (obtained with objdump -d) for the main:

00000000004004b4 <main>:
  4004b4:   b8 00 00 00 00          mov    $0x0,%eax
  4004b9:   48 be 00 00 00 00 00    movabs $0x3ff0000000000000,%rsi
  4004c0:   00 f0 3f
  4004c3:   48 b9 00 00 00 00 00    movabs $0x4000000000000000,%rcx
  4004ca:   00 00 40
  4004cd:   ba 00 00 00 00          mov    $0x0,%edx
  4004d2:   48 89 34 c5 40 10 60    mov    %rsi,0x601040(,%rax,8)
  4004d9:   00
  4004da:   48 89 0c c5 40 c4 24    mov    %rcx,0x524c440(,%rax,8)
  4004e1:   05
  4004e2:   48 89 14 c5 40 78 e9    mov    %rdx,0x9e97840(,%rax,8)
  4004e9:   09
  4004ea:   48 83 c0 01             add    $0x1,%rax
  4004ee:   48 3d 80 96 98 00       cmp    $0x989680,%rax
  4004f4:   75 dc                   jne    4004d2 <main+0x1e>
  4004f6:   b8 00 00 00 00          mov    $0x0,%eax
  4004fb:   c3                      retq
  4004fc:   90                      nop
  4004fd:   90                      nop
  4004fe:   90                      nop
  4004ff:   90                      nop

The three movs should correspond to the store to three different vectors in memory. I would expect that the numbers of stores retired to be very close to 30M and virtually no loads since I am just initializing three arrays. However this is the results I get on a Sandy Bridge machine:

$ perf stat -e L1-dcache-loads,L1-dcache-stores ./benchmark 

 Performance counter stats for './benchmark':

        46,017,360 L1-dcache-loads                                             
        75,985,205 L1-dcache-stores

And this is for a Nehalem machine:

$ perf stat -e L1-dcache-loads,L1-dcache-stores  ./benchmark

 Performance counter stats for './benchmark':                                                                                                                                                                                                

        45,255,731 L1-dcache-loads                                                                                                                                                                                                           
        60,164,676 L1-dcache-stores

How are retired loads and stores accounted for every mov operations that target the memory? How come there are so many loads even though no data is actually read from memory?

Art · Accepted Answer

So I got a bit curious about this and did some research. Mostly to see how much more useful the perf framework is now since last time I used it it crashed the kernel on a shared dev machine and 25 other developers were quite unhappy with my experiments.

First, let's verify that I see what you see:

$ cc -O -o xx xx.c && perf stat -e L1-dcache-loads,L1-dcache-stores ./xx

 Performance counter stats for './xx':

        58,764,160 L1-dcache-loads
        81,640,635 L1-dcache-stores

Yup. Even larger numbers. So what is going on? Let's record and analyze this slightly better:

$ cc -O -o xx xx.c && perf record -e L1-dcache-loads,L1-dcache-stores ./xx
[... blah blah ...]
$ perf report --stdio
[... blah blah ...]
# Samples: 688  of event 'L1-dcache-loads'
# Event count (approx.): 56960661
#
# Overhead  Command      Shared Object    Symbol
# ........  .......  .................  ........
#
    95.80%       xx  [kernel.kallsyms]  [k] 0xffffffff811176ee
     4.20%       xx  xx                 [.] main


# Samples: 656  of event 'L1-dcache-stores'
# Event count (approx.): 80623804
#
# Overhead  Command      Shared Object    Symbol
# ........  .......  .................  ........
#
    61.72%       xx  [kernel.kallsyms]  [k] 0xffffffff811176ee
    38.28%       xx  xx                 [.] main

Aha, so the kernel is responsible for most of those loads and stores. The counters that we get count the cache accesses that both the kernel and userland do.

What's happening is that the physical pages of the program (including the data segment and bss) are not mapped or even allocated when you start the program. The kernel faults them in when you touch them for the first time (or in the future if they got paged out). We can see this with:

$ cc -O -o foo foo.c && perf stat -e faults ./xx

 Performance counter stats for './xx':

            58,696 faults

We're actually performing 58.7k page faults just during that one run. Since the page size is 4096 bytes we get 58696*4096=240418816 which is roughly the 240000000 bytes for your arrays and the rest is the program, stack, and all kinds of junk in libc and ld.so required for running.

So now we can figure out the numbers. Let's look at the stores first, because they should be the easiest to figure out. 80623804*0.3828=30862792.1712, so that makes sense. We expected 30 million stores and we got 30.9. Since the performance counters sample and aren't perfectly accurate, this is expected. Some of the loads that the kernel did spilled over and got accounted to the program. In other runs I got less than 30M counts for the userland.

In the same way userland gets 2.4M loads. I suspect those aren't actually loads in userland but for some reason some accesses that the kernel does when returning from traps that get accounted to your program. Or something like that. I'm not sure about those, I don't like them, but let's see if we can remove that noise and check the theory that it has something to do with garbage data caused by page faults.

Here's an updated version of your test:

#include <stdio.h>

#define STREAM_ARRAY_SIZE   10000000

static double   a[STREAM_ARRAY_SIZE],
    b[STREAM_ARRAY_SIZE],
    c[STREAM_ARRAY_SIZE];

void
setup(void)
{
    memset(a, 0, sizeof a);
    memset(b, 0, sizeof b);
    memset(c, 0, sizeof c);
}

void
bench(void)
{
    ssize_t j;

    for (j = 0; j < STREAM_ARRAY_SIZE; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
    }
}

int
main(int argc, char **argv)
{
    setup();
    bench();
    return 0;
}

I make sure to get all the page faults during setup and then any counters that spill during bench should have very little kernel noise in them.

$ cc -O -o xx xx.c && perf record -e faults,L1-dcache-loads,L1-dcache-stores ./xx
[...]
$ perf report --stdio
[...]
# Samples: 468  of event 'faults'
# Event count (approx.): 58768
#
# Overhead  Command      Shared Object             Symbol
# ........  .......  .................  .................
#
    99.20%       xx  libc-2.12.so       [.] __memset_sse2
     0.69%       xx  ld-2.12.so         [.] do_lookup_x
     0.08%       xx  ld-2.12.so         [.] dl_main
     0.02%       xx  ld-2.12.so         [.] _dl_start
     0.01%       xx  ld-2.12.so         [.] _start
     0.01%       xx  [kernel.kallsyms]  [k] 0xffffffff8128f75f


# Samples: 770  of event 'L1-dcache-loads'
# Event count (approx.): 61518838
#
# Overhead  Command      Shared Object             Symbol
# ........  .......  .................  .................
#
    96.14%       xx  [kernel.kallsyms]  [k] 0xffffffff811176ee
     3.86%       xx  libc-2.12.so       [.] __memset_sse2


# Samples: 866  of event 'L1-dcache-stores'
# Event count (approx.): 98243116
#
# Overhead  Command      Shared Object             Symbol
# ........  .......  .................  .................
#
    53.69%       xx  [kernel.kallsyms]  [k] 0xffffffff811176ee
    30.62%       xx  xx                 [.] bench
    15.69%       xx  libc-2.12.so       [.] __memset_sse2

And there you have it. The page faults happened during the call to memset and some during dynamic linking, the noise that was previously in main now happens during memset, bench itself doesn't have any loads and around 30 million stores. Just like we expected. An amusing note here is that memset knows how to be efficient on this machine and only did half the stores compared to your test to fill the same amount of memory. The "sse2" in __memset_sse2 is a good hint to how.

I just realized that one thing might not be clear and I don't know where to put it, so I'll drop it here. The performance counters do count the events accurately, but as far as I know if you want to know where those events happen the CPU can only generate a trap once every X events recorded. So the tools don't know exactly where the events happen (it would be too slow to run that way), instead we wait until the trap comes and account all the X events to that instruction/function. I think, but I'm not sure that X is at least 10000. So if the function bench just touches the stack once and that happens to generate the L1-dcache-load spill trap, you'll account 10000 reads to that reading of the stack. Also, as far as I know, TLB misses (of which you'll get around 58593) in the bench function are also resolved through the L1 cache and will be accounted to that. So no matter what you do you'll never get exactly the numbers you expect here.

Understanding number of loads and stores retired in a x86 micro-benchmark

Tags:

c

x86

gcc

assembly

perf

igon

1 Answers

Art

Recent Activity

Donate For Us

Understanding number of loads and stores retired in a x86 micro-benchmark

Tags:

c

x86

gcc

assembly

perf

igon

1 Answers

Art

Related questions

Recent Activity

Donate For Us