While trying to know how long a line of C code used to execute, I noticed this weird thing :
int main (char argc, char * argv[]) { time_t begin, end; uint64_t i; double total_time, free_time; int A = 1; int B = 1; begin = clock(); for (i = 0; i<(1<<31)-1; i++); end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); begin = clock(); for (i = 0; i<(1<<31)-1; i++) { A += B%2; } end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); return(0); }
Which when executed displays :
5.873425 4.826874
Why does the empty loop use more time than the second which has an instruction within ? Of course I've tried many variants but everytime, an empty loop takes more time than one with one single instruction within.
Note that I have tried swapping loops order and adding some warmup code and it didn't change my problem at all.
I'm using codeblocks as IDE with GNU gcc compiler, linux ubuntu 14.04 and have a quadcore intel i5 at 2.3GHz (I've tried running the programm on a single core, this doesn't change the result).
An empty loop is a loop which has an empty body, e.g. for(int i = 0; i < 10; ++i) {} while(cin) {} (note that the second example here also happens to be endless)
Empty Statement in a for Loop Here, the initialization block of the for loop contains nothing, hence an empty statement. For creating a loop that runs infinitely, we can use empty statements. However, if we use break statements inside the body of the loop, then the loop can terminate.
An empty loop is a loop which does not have any updation or value of iteration. For example, for(int i = 1;;) (in Java) An empty loop is infinite.
The body of the while can be empty. This is because a null statement, one that consists only of a semicolon, is syntactically valid in Java.
Assuming that your code uses a 32 bit integer int
type (which your system probably does), then nothing can be determined from your code. Instead, it exhibits undefined behavior.
foo.c:5:5: error: first parameter of 'main' (argument count) must be of type 'int' int main (char argc, char * argv[]) { ^ foo.c:13:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow] for (i = 0; i<(1<<31)-1; i++); ^ foo.c:19:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow] for (i = 0; i<(1<<31)-1; i++) { ^
Let's try to fix that:
#include <stdint.h> #include <stdio.h> #include <time.h> #include <limits.h> int main (int argc, char * argv[]) { time_t begin, end; uint64_t i; double total_time, free_time; int A = 1; int B = 1; begin = clock(); for (i = 0; i<INT_MAX; i++); end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); begin = clock(); for (i = 0; i<INT_MAX; i++) { A += B%2; } end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); return(0); }
Now, let's look at the assembly output of this code. Personally, I find LLVM's internal assembly very readable, so I'm going to show that. I'll produce it by running:
clang -O3 foo.c -S -emit-llvm -std=gnu99
Here's the relevant portion of the output (the main function):
define i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 { %1 = tail call i64 @"\01_clock"() #3 %2 = tail call i64 @"\01_clock"() #3 %3 = sub nsw i64 %2, %1 %4 = sitofp i64 %3 to double %5 = fdiv double %4, 1.000000e+06 %6 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %5) #3 %7 = tail call i64 @"\01_clock"() #3 %8 = tail call i64 @"\01_clock"() #3 %9 = sub nsw i64 %8, %7 %10 = sitofp i64 %9 to double %11 = fdiv double %10, 1.000000e+06 %12 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %11) #3 ret i32 0 }
Note that there are no operations between the calls to clock()
for either case. So they are both compiled to the exact same thing.
The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code.
Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a nop
makes the loop long enough to avoid the stall.
One thing that happens if we look at the code:
0x00000000004005af <+50>: addq $0x1,-0x20(%rbp) 0x00000000004005b4 <+55>: cmpq $0x7fffffff,-0x20(%rbp) 0x00000000004005bc <+63>: jb 0x4005af <main+50>
We read i
and write it back (addq
). We read it immediately again, and compare it (cmpq
). And then we loop. But the loop uses branch prediction. So at the time when the addq
is executed, the processor isn't really sure it is allowed to write to i
(because branch prediction could be wrong).
Then we compare to i
. The processor will try to avoid reading i
from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to i
by adding to it, and instead of reading i
, the cmpq
instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to i
actually happened or not! So that could introduce a stall here.
The problem here is that the conditional jump, the addq
which leads to a conditional store, and the cmpq
which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take i
from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one nop
gives the processor enough time.
Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest):
So what the processor does internally in the short, slow loop:
i
from L1 cachei
i
to L1 cachei
is written to L1 cachei
from L1 cachei
with INT_MAXIn the long, fast, loop the processor does:
i
from L1 cachei
i
to L1 cachei
directly from the "store" instruction without touching L1 cachei
with INT_MAXIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With