Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would identical copies of the same C loop in the same program take significantly but consistently different times to execute?

I hope that I have reduced my question to a simple and reproducible test case. The source (which is here) contains 10 copies of an identical simple loop. Each loop is of the form:

#define COUNT (1000 * 1000 * 1000)
volatile uint64_t counter = 0;

void loopN(void) {
  for (int j = COUNT; j != 0; j--) {
    uint64_t val = counter;
    val = val + 1;
    counter = val;
  }
  return;
}

The 'volatile' of the variable is important, as it forces the value to read and written from memory on each iteration. Each loop is aligned to 64 bytes using '-falign-loops=64' and produces identical assembly except for the relative offset to the global:

   400880:       48 8b 15 c1 07 20 00    mov    0x2007c1(%rip),%rdx  # 601048 <counter>
   400887:       48 83 c2 01             add    $0x1,%rdx
   40088b:       83 e8 01                sub    $0x1,%eax
   40088e:       48 89 15 b3 07 20 00    mov    %rdx,0x2007b3(%rip)  # 601048 <counter>
   400895:       75 e9                   jne    400880 <loop8+0x20>

I'm running Linux 3.11 on an Intel Haswell i7-4470. I'm compiling the program with GCC 4.8.1 and the command line:

 gcc -std=gnu99 -O3 -falign-loops=64 -Wall -Wextra same-function.c -o same-function

I'm also using attribute((noinline)) within the source to make the assembly clearer, but this is not necessary to observe the issue. I find the fastest and slowest functions with a shell loop:

for n in 0 1 2 3 4 5 6 7 8 9; 
do echo same-function ${n}:; 
/usr/bin/time -f "%e seconds" same-function ${n}; 
/usr/bin/time -f "%e seconds" same-function ${n}; 
/usr/bin/time -f "%e seconds" same-function ${n}; 
done

It produces results which are consistent to about 1% from run to run, with the exact numbers of the fastest and slowest functions varying depending on the exact binary layout:

same-function 0:
2.08 seconds
2.04 seconds
2.06 seconds
same-function 1:
2.12 seconds
2.12 seconds
2.12 seconds
same-function 2:
2.10 seconds
2.14 seconds
2.11 seconds
same-function 3:
2.04 seconds
2.04 seconds
2.05 seconds
same-function 4:
2.05 seconds
2.00 seconds
2.03 seconds
same-function 5:
2.07 seconds
2.07 seconds
1.98 seconds
same-function 6:
1.83 seconds
1.83 seconds
1.83 seconds
same-function 7:
1.95 seconds
1.98 seconds
1.95 seconds
same-function 8:
1.86 seconds
1.88 seconds
1.86 seconds
same-function 9:
2.04 seconds
2.04 seconds
2.02 seconds

In this case, we see that that loop2() is one of the slowest to execute and loop6() is one of the fastest, with a difference just over 10%. We reconfirm this by testing just these two cases repeatedly with a different method:

nate@haswell$ N=2; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
     7,180,104,866 cycles                    #    3.391 GHz
     7,169,930,711 cycles                    #    3.391 GHz
     7,150,190,394 cycles                    #    3.391 GHz
     7,188,959,096 cycles                    #    3.391 GHz
     7,177,272,608 cycles                    #    3.391 GHz
     7,093,246,955 cycles                    #    3.391 GHz
     7,210,636,865 cycles                    #    3.391 GHz
     7,239,838,211 cycles                    #    3.391 GHz
     7,172,716,779 cycles                    #    3.391 GHz
     7,223,252,964 cycles                    #    3.391 GHz

nate@haswell$ N=6; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
     6,234,770,361 cycles                    #    3.391 GHz
     6,199,096,296 cycles                    #    3.391 GHz
     6,213,348,126 cycles                    #    3.391 GHz
     6,217,971,263 cycles                    #    3.391 GHz
     6,224,779,686 cycles                    #    3.391 GHz
     6,194,117,897 cycles                    #    3.391 GHz
     6,225,259,274 cycles                    #    3.391 GHz
     6,244,391,509 cycles                    #    3.391 GHz
     6,189,972,381 cycles                    #    3.391 GHz
     6,205,556,306 cycles                    #    3.391 GHz

Considering this confirmed, we re-read every word in every Intel architectural manual ever written, sift through every page on the entire web that mentions the words 'computer' or 'programming', and meditate in isolation on a mountaintop for 6 years. Failing to achieve any sort of enlightenment, we come down to civilization, shave our beard, take a bath, and ask the experts of StackOverflow:

What can possibly be happening here?

Edit: With Benjamin's help (see his answer below) I've come up with an even more succinct test case. It's a standalone 20 lines of assembly. Changing from using SUB to SBB causes a 15% difference in performance even though the result remains the same and the same number of instructions are executed. Explanations? I think I'm getting closer to one.

; Minimal example, see also http://stackoverflow.com/q/26266953/3766665
; To build (Linux):
;   nasm -felf64 func.asm
;   ld func.o
; Then run:
;   perf stat -r10 ./a.out
; On Haswell and Sandy Bridge, observed runtime varies 
; ~15% depending on whether sub or sbb is used in the loop
section .text
global _start
_start:
  push qword 0h       ; put counter variable on stack
  jmp loop            ; jump to function
align 64              ; function alignment.
loop:
  mov rcx, 1000000000
align 64              ; loop alignment.
l:
  mov rax, [rsp]
  add rax, 1h
  mov [rsp], rax
; sbb rcx, 1h         ; which is faster: sbb or sub?
  sub rcx, 1h         ; switch, time it, and find out
  jne l               ; (rot13 spoiler: foo vf snfgre ol 15%)
fin:                  ; If that was too easy, explain why.
  mov eax, 60
  xor edi, edi        ; End of program. Exit with code 0
  syscall
like image 489
Nathan Kurz Avatar asked Oct 08 '14 21:10

Nathan Kurz


2 Answers

Looking at the full perf stat output, you will see that it is not the number of instructions that varies but the number of stalled cycles.

Looking at the disassembly, I have found two things:

  1. The offset to the counter variable varies between the functions. Making counter local to each function did not make the behaviour go away, though.
  2. The functions are not placed on 64 byte boundaries, so they may cover a different amount of cache lines. Compiling with -falign-functions=64 did make the differences go away almost completely.

Doing the test with the aforementioned changes on my machine then yields:

for f in $( seq 7); do perf stat -e cycles -r3 ./same-function $f 2>&1; done|grep cycles 6,070,933,420 cycles ( +- 0.11% ) 6,052,771,142 cycles ( +- 0.06% ) 6,099,676,333 cycles ( +- 0.07% ) 6,092,962,697 cycles ( +- 0.16% ) 6,151,861,993 cycles ( +- 0.69% ) 6,074,323,033 cycles ( +- 0.36% ) 6,174,434,653 cycles ( +- 0.65% )

I don't know more about the nature of the stalls you found, though.

EDIT: I have made counter a volatile member in each function, tested different compilations on my I7-3537U and found `-falign-loops=64' actually to be the slowest:

$ gcc  -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function
$ gcc -falign-loops=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-l64
$ gcc -falign-functions=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-f64
$ for prog in same-function{,-l64,-f64}; do echo $prog; for f in $(seq 7); do perf stat -e cycles -r10 ./$prog $f 2>&1; done|grep cycl; done
same-function
     6,079,966,292      cycles                     ( +-  0.19% )
     7,419,053,569      cycles                     ( +-  0.07% )
     6,136,061,105      cycles                     ( +-  0.27% )
     7,282,434,896      cycles                     ( +-  0.74% )
     6,104,866,406      cycles                     ( +-  0.16% )
     7,342,985,942      cycles                     ( +-  0.52% )
     6,208,373,040      cycles                     ( +-  0.50% )
same-function-l64
     7,336,838,175      cycles                     ( +-  0.46% )
     7,358,913,923      cycles                     ( +-  0.52% )
     7,412,570,515      cycles                     ( +-  0.38% )
     7,435,048,756      cycles                     ( +-  0.10% )
     7,404,834,458      cycles                     ( +-  0.34% )
     7,291,095,582      cycles                     ( +-  0.99% )
     7,312,052,598      cycles                     ( +-  0.95% )
same-function-f64
     6,103,059,996      cycles                     ( +-  0.12% )
     6,116,601,533      cycles                     ( +-  0.29% )
     6,120,841,824      cycles                     ( +-  0.18% )
     6,114,278,098      cycles                     ( +-  0.09% )
     6,105,938,586      cycles                     ( +-  0.14% )
     6,101,672,717      cycles                     ( +-  0.19% )
     6,121,339,944      cycles                     ( +-  0.11% )

More details on align-loops vs align-functions:

$ for prog in same-function{-l64,-f64}; do sudo perf stat -d -r10 ./$prog 0; done

 Performance counter stats for './same-function-l64 0' (10 runs):

       2396.608194      task-clock:HG (msec)      #    1.001 CPUs utilized            ( +-  0.64% )
                56      context-switches:HG       #    0.024 K/sec                    ( +-  5.51% )
                 1      cpu-migrations:HG         #    0.000 K/sec                    ( +- 74.78% )
                46      page-faults:HG            #    0.019 K/sec                    ( +-  0.63% )
     7,331,450,530      cycles:HG                 #    3.059 GHz                      ( +-  0.51% ) [85.68%]
     5,332,248,218      stalled-cycles-frontend:HG #   72.73% frontend cycles idle     ( +-  0.71% ) [71.42%]
   <not supported>      stalled-cycles-backend:HG
     5,000,800,933      instructions:HG           #    0.68  insns per cycle
                                                  #    1.07  stalled cycles per insn  ( +-  0.04% ) [85.73%]
     1,000,446,303      branches:HG               #  417.443 M/sec                    ( +-  0.04% ) [85.75%]
             8,461      branch-misses:HG          #    0.00% of all branches          ( +-  6.05% ) [85.76%]
   <not supported>      L1-dcache-loads:HG
            45,593      L1-dcache-load-misses:HG  #    0.00% of all L1-dcache hits    ( +-  3.61% ) [85.77%]
             6,148      LLC-loads:HG              #    0.003 M/sec                    ( +-  8.80% ) [71.36%]
   <not supported>      LLC-load-misses:HG

       2.394456699 seconds time elapsed                                          ( +-  0.64% )


 Performance counter stats for './same-function-f64 0' (10 runs):

       1998.936383      task-clock:HG (msec)      #    1.001 CPUs utilized            ( +-  0.61% )
                60      context-switches:HG       #    0.030 K/sec                    ( +- 17.77% )
                 1      cpu-migrations:HG         #    0.001 K/sec                    ( +- 47.86% )
                46      page-faults:HG            #    0.023 K/sec                    ( +-  0.68% )
     6,107,877,836      cycles:HG                 #    3.056 GHz                      ( +-  0.34% ) [85.63%]
     4,112,602,649      stalled-cycles-frontend:HG #   67.33% frontend cycles idle     ( +-  0.52% ) [71.41%]
   <not supported>      stalled-cycles-backend:HG
     5,000,910,172      instructions:HG           #    0.82  insns per cycle
                                                  #    0.82  stalled cycles per insn  ( +-  0.01% ) [85.72%]
     1,000,423,026      branches:HG               #  500.478 M/sec                    ( +-  0.02% ) [85.77%]
            10,660      branch-misses:HG          #    0.00% of all branches          ( +- 13.23% ) [85.80%]
   <not supported>      L1-dcache-loads:HG
            47,492      L1-dcache-load-misses:HG  #    0.00% of all L1-dcache hits    ( +- 14.82% ) [85.80%]
            11,719      LLC-loads:HG              #    0.006 M/sec                    ( +- 42.44% ) [71.28%]
   <not supported>      LLC-load-misses:HG

       1.997319759 seconds time elapsed                                          ( +-  0.62% )

Both executables have a very low instructions/cycle count, probably due to the minimalistic nature of the loop and the memory pressure that it is inflicting, but I have no idea on why one is worse than the other.

I have also tried something along the lines of

$ for prog in same-function{-l64,-f64}; do sudo perf stat -eL1-{d,i}cache-load-misses,L1-dcache-store-misses,cs,cycles,instructions -r10 ./$prog 0; done

and

$ sudo perf record -F25000 -e'{cycles:pp,stalled-cycles-frontend}' ./same-function-l64 0
[ perf record: Woken up 28 times to write data ]
[ perf record: Captured and wrote 6.771 MB perf.data (~295841 samples) ]
$ sudo perf report --group -Sloop0 -n --show-total-period --stdio
$ sudo perf annotate --group -sloop0  --stdio

But without any success on finding the culprit. Still, I felt it might be helpful to jot it down here for you anyway...

Edit 2: Here is my patch to same-function.c:

$ git diff -u -U0
diff --git a/same-function.c b/same-function.c
index f78449e..78a5772 100644
--- a/same-function.c
+++ b/same-function.c
@@ -20 +20 @@ done
-volatile uint64_t counter = 0;
+//volatile uint64_t counter = 0;
@@ -22,0 +23 @@ COMPILER_NO_INLINE void loop0(void) {
+volatile uint64_t counter = 0;
@@ -31,0 +33 @@ COMPILER_NO_INLINE void loop1(void) {
+volatile uint64_t counter = 0;
@@ -40,0 +43 @@ COMPILER_NO_INLINE void loop2(void) {
+volatile uint64_t counter = 0;
@@ -49,0 +53 @@ COMPILER_NO_INLINE void loop3(void) {
+volatile uint64_t counter = 0;
@@ -58,0 +63 @@ COMPILER_NO_INLINE void loop4(void) {
+volatile uint64_t counter = 0;
@@ -67,0 +73 @@ COMPILER_NO_INLINE void loop5(void) {
+volatile uint64_t counter = 0;
@@ -76,0 +83 @@ COMPILER_NO_INLINE void loop6(void) {
+volatile uint64_t counter = 0;
@@ -85,0 +93 @@ COMPILER_NO_INLINE void loop7(void) {
+volatile uint64_t counter = 0;
@@ -94,0 +103 @@ COMPILER_NO_INLINE void loop8(void) {
+volatile uint64_t counter = 0;
@@ -103,0 +113 @@ COMPILER_NO_INLINE void loop9(void) {
+volatile uint64_t counter = 0;
@@ -135 +145 @@ int main(int argc, char** argv) {
-}
\ No newline at end of file
+}

Edit 3: Same thing with even more minimal example:

; Minimal example, see also http://stackoverflow.com/q/26266953/3766665 
; To build (Linux):
;   nasm -felf64 func.asm
;   ld func.o
; Then run:
;   perf stat -r10 ./a.out
; Runtime varies ~10% depending on whether 
section .text
global _start
_start:
  push qword 0h       ; put counter variable on stack
  jmp loop            ; jump to function
;align 64             ; function alignment. Try commenting this
loop:
  mov rcx, 1000000000
;align 64             ; loop alignment. Try commenting this
l:
  mov rax, [rsp]
  add rax, 1h
  mov [rsp], rax
  sub rcx, 1h
  jne l
fin:                  ; End of program. Exit with code 0
  mov eax, 60
  xor edi, edi
  syscall

Same effect here. Interesting.

Cheers, Benjamin

like image 156
bking Avatar answered Nov 15 '22 10:11

bking


A few years ago, I would have told you to check for any difference in the internal state of the CPU when it arrives at any of those loops; as this was known to have a profound effect on the capacity of the Out-of-Order predictions process (or something like that). For example, the performance of the very same loop could change by up to something like 15-20% depending on what the CPU was doing just before entering the loop and just the fact of jumping from two different points could be sufficient to change the execution speed.

In your case, it's quite to test. All you have to do is to change the order of the instructions in the IF block; for example replacing the following:

  switch (firstLetter) {
  case '0': loop0(); break;
  case '1': loop1(); break;
  case '2': loop2(); break;
  case '3': loop3(); break;
  case '4': loop4(); break;
  case '5': loop5(); break;
  case '6': loop6(); break;
  case '7': loop7(); break;
  case '8': loop8(); break;
  case '9': loop9(); break;
  default: goto die_usage;
  }

with:

  switch (firstLetter) {
  case '9': loop9(); break;
  case '8': loop8(); break;
  case '7': loop7(); break;
  case '6': loop6(); break;
  case '5': loop5(); break;
  case '4': loop4(); break;
  case '3': loop3(); break;
  case '2': loop2(); break;
  case '1': loop1(); break;
  case '0': loop0(); break;
  default: goto die_usage;
  }

or with any random order. Of course, you should check the generated assembly code to make sure that the order of the instructions has not been reordered by the compilator.

Also, as your loops are inside individual functions; you should also make sure that these functions are themselves aligned on 64 bytes boundaries.

like image 1
SylvainL Avatar answered Nov 15 '22 10:11

SylvainL