Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What constitutes a read or write to memory/cache in x86 assembly, according to valgrind?

I am looking into why some compiled Pascal code is slower than other languages/compilers for a simple test case. I believe it has to do with the preference of the FreePascal x86 compiler to use memory (and the CPU using the data cache) even on the highest optimization level, whereas other optimizing compilers will cheat and keep everything in registers if using memory is not explicitly required. When I run the program through valgrind's cachegrind tool I see a large amount of data references, and I am not entirely sure exactly which assembly instructions correspond to each read and write on the data cache. (The number of instruction references and branches reported by valgrind all make sense to me, but not the number of data references).

As a point of reference, here is the simple Pascal code. I am just calculating an xorshift random number 1 billion times.

program xorshifts;

var x:Uint64;
var i:longint;

function xorshifts(x:Uint64):Uint64;
begin
    x:=x xor (x>>12);
    x:=x xor (x<<25);
    x:=x xor (x>>27);
    xorshifts:=x*$2545F4914F6CDD1D;
end;

begin
    x:=1337;
    for i:=0 to 999999999 do
        x:=xorshifts(x);
    writeln(x);
end.

Here are the relevant parts of the compiled assembly code:

.Lj5:
addl    $1,U_$P$XORSHIFTS_$$_I(%rip)                ; increment i
movq    U_$P$XORSHIFTS_$$_X(%rip),%rdi              ; put the previous result in %rdi
call    P$XORSHIFTS_$$_XORSHIFTS$QWORD$$QWORD@PLT   ; call xorshift to get new result
movq    %rax,U_$P$XORSHIFTS_$$_X(%rip)              ; save the result
cmpl    $999999999,U_$P$XORSHIFTS_$$_I(%rip)        ; loop to i=1 billion
jnge    .Lj5

P$XORSHIFTS_$$_XORSHIFTS$QWORD$$QWORD: ; this is just the xorshift function, nothing to see here
movq    %rdi,%rax
shrq    $12,%rax
xorq    %rdi,%rax
movq    %rax,%rdx
shlq    $25,%rdx
xorq    %rax,%rdx
movq    %rdx,%rax
shrq    $27,%rax
xorq    %rdx,%rax
movq    $2685821657736338717,%rdx
imulq   %rdx,%rax
ret

Here is the cachegrind output: valgrind --tool=cachegrind --branch-sim=yes ./xorshifts

==27411== I   refs:      18,000,005,765
==27411== I1  misses:               151
==27411== LLi misses:               151
==27411== I1  miss rate:           0.00%
==27411== LLi miss rate:           0.00%
==27411== 
==27411== D   refs:       6,000,003,164  (4,000,001,091 rd   + 2,000,002,073 wr)
==27411== D1  misses:               134  (           20 rd   +           114 wr)
==27411== LLd misses:               134  (           20 rd   +           114 wr)
==27411== D1  miss rate:            0.0% (          0.0%     +           0.0%  )
==27411== LLd miss rate:            0.0% (          0.0%     +           0.0%  )
==27411== 
==27411== LL refs:                  285  (          171 rd   +           114 wr)
==27411== LL misses:                285  (          171 rd   +           114 wr)
==27411== LL miss rate:             0.0% (          0.0%     +           0.0%  )
==27411== 
==27411== Branches:       1,000,000,987  (1,000,000,974 cond +            13 ind)
==27411== Mispredicts:              277  (          270 cond +             7 ind)
==27411== Mispred rate:             0.0% (          0.0%     +          53.8%   )

So 18 billion instruction references - that makes sense because there are 18 instructions in the loop, and the loop executes 1 billion times.

1 billion branches. Fine.

But 6 billion data references, with 4 billion read and 2 billion write? I don't count 4 billion reads. Here's what I count:

.Lj5:
addl    $1,U_$P$XORSHIFTS_$$_I(%rip)                ; data write?
movq    U_$P$XORSHIFTS_$$_X(%rip),%rdi              ; data read?
call    P$XORSHIFTS_$$_XORSHIFTS$QWORD$$QWORD@PLT   ; 
movq    %rax,U_$P$XORSHIFTS_$$_X(%rip)              ; data write?
cmpl    $999999999,U_$P$XORSHIFTS_$$_I(%rip)        ; data read?
jnge    .Lj5

And FYI even the actual disassembled binary looks no different:

   4010d8:   ff ff ff
   4010db:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   4010e0:   83 05 59 d7 02 00 01    addl   $0x1,0x2d759(%rip)        # data write?
   4010e7:   48 8b 3d 42 d7 02 00    mov    0x2d742(%rip),%rdi        # data read?
   4010ee:   e8 9d ff ff ff          call   401090 
   4010f3:   48 89 05 36 d7 02 00    mov    %rax,0x2d736(%rip)        # data write?
   4010fa:   81 3d 3c d7 02 00 ff    cmpl   $0x3b9ac9ff,0x2d73c(%rip)        # data read?
   401101:   c9 9a 3b
   401104:   7c da                   jl     4010e0 <main+0x20>

Or 2 billion of each data read and data write, 4 billion total. Where do the other 2 billion read instructions come from? Does every write also count as a read? Does the call instruction count as a read/write/both? Is it related to the symbolic addressing? Does it have something to do with the PLT? Is valgrind smoking something?

And even worse, when you use callgrind instead of cachegrind, you get a different breakdown of the 6 billion data references: valgrind --tool=callgrind --simulate-cache=yes ./xorshifts

==13672== I1  misses:               151
==13672== LLi misses:               151
==13672== I1  miss rate:           0.00%
==13672== LLi miss rate:           0.00%
==13672== 
==13672== D   refs:       6,000,003,163  (3,000,001,083 rd + 3,000,002,080 wr)
==13672== D1  misses:               134  (           20 rd +           114 wr)
==13672== LLd misses:               134  (           20 rd +           114 wr)
==13672== D1  miss rate:            0.0% (          0.0%   +           0.0%  )
==13672== LLd miss rate:            0.0% (          0.0%   +           0.0%  )
==13672== 
==13672== LL refs:                  285  (          171 rd +           114 wr)
==13672== LL misses:                285  (          171 rd +           114 wr)
==13672== LL miss rate:             0.0% (          0.0%   +           0.0%  )

What is going on here? And more generally, because obviously I fail to understand, what constitutes a read or write?

like image 499
mDiPalma Avatar asked Oct 30 '25 02:10

mDiPalma


1 Answers

    1. Compile with Valgrind debugging information:
      fpc -gv xorshifts.pas
      
    2. Collect some data:
      valgrind --tool=cachegrind --branch-sim=yes ./xorshifts & \
        sleep 2 && kill $!
      
    3. Examine line-by-line counts:
      cg_annotate --auto=yes cachegrind.out.$!
      
  1. Assembly-level annotation:
    1. Use ‑‑dump‑instr=yes:
      valgrind --tool=callgrind --simulate-cache=yes --dump-instr=yes ./xorshifts & \
        sleep 2 && kill $!
      
    2. As of Valgrind version 3.18, this information can only be viewed with KCachegrind. This is really confusing because the text-style formatting in the documentation suggests callgrind_annotate(1) could do just that.
      kcachegrind callgrind.out.$!
      
    3. Navigate to the “Machine Code” tab at the bottom bar.
    1. You are profiling the entire executable. The “missing” counts can possibly be attributed to the code executed before and after PASCALMAIN. Use ‑‑toggle‑collect=main to count just this function.
    2. There is no instruction that can accept two memory operands as both the source and destination. Hence no instruction can be counted as both DR and DW.
    3. call is “essentially” shorthand for push rip, jmp func. The push is obviously a data write. Accordingly, ret is a data read access.
    4. It is not related to symbolic addressing.
    5. It has nothing to do with any PLT. If you compile your program without ‑gv and run ldd xorshift, you will notice that it does not depend on any dynamic shared objects.
    6. Last time I checked, valgrind(1) ain’t smoking nothin’.
like image 55
Kai Burghardt Avatar answered Nov 02 '25 14:11

Kai Burghardt