While trying to know how long a line of C code used to execute, I noticed this weird thing : <pre class="prettyprint"><code>int main (char argc, char * argv[]) { time_t begin, end; uint64_t i; double total_time, free_time; int A = 1; int B = 1; begin = clock(); for (i = 0; i<(1<<31)-1; i++); end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); begin = clock(); for (i = 0; i<(1<<31)-1; i++) { A += B%2; } end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); return(0); } </code></pre> Which when executed displays : <pre class="prettyprint"><code>5.873425 4.826874 </code></pre> Why does the empty loop use more time than the second which has an instruction within ? Of course I've tried many variants but everytime, an empty loop takes more time than one with one single instruction within. Note that I have tried swapping loops order and adding some warmup code and it didn't change my problem at all. I'm using codeblocks as IDE with GNU gcc compiler, linux ubuntu 14.04 and have a quadcore intel i5 at 2.3GHz (I've tried running the programm on a single core, this doesn't change the result).

Assuming that your code uses a 32 bit integer <code>int</code> type (which your system probably does), then nothing can be determined from your code. Instead, it exhibits undefined behavior. <pre class="prettyprint"><code>foo.c:5:5: error: first parameter of 'main' (argument count) must be of type 'int' int main (char argc, char * argv[]) { ^ foo.c:13:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow] for (i = 0; i<(1<<31)-1; i++); ^ foo.c:19:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow] for (i = 0; i<(1<<31)-1; i++) { ^ </code></pre> Let's try to fix that: <pre class="prettyprint"><code>#include <stdint.h> #include <stdio.h> #include <time.h> #include <limits.h> int main (int argc, char * argv[]) { time_t begin, end; uint64_t i; double total_time, free_time; int A = 1; int B = 1; begin = clock(); for (i = 0; i<INT_MAX; i++); end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); begin = clock(); for (i = 0; i<INT_MAX; i++) { A += B%2; } end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); return(0); } </code></pre> Now, let's look at the assembly output of this code. Personally, I find LLVM's internal assembly very readable, so I'm going to show that. I'll produce it by running: <pre class="prettyprint"><code>clang -O3 foo.c -S -emit-llvm -std=gnu99 </code></pre> Here's the relevant portion of the output (the main function): <pre class="prettyprint"><code>define i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 { %1 = tail call i64 @"\01_clock"() #3 %2 = tail call i64 @"\01_clock"() #3 %3 = sub nsw i64 %2, %1 %4 = sitofp i64 %3 to double %5 = fdiv double %4, 1.000000e+06 %6 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %5) #3 %7 = tail call i64 @"\01_clock"() #3 %8 = tail call i64 @"\01_clock"() #3 %9 = sub nsw i64 %8, %7 %10 = sitofp i64 %9 to double %11 = fdiv double %10, 1.000000e+06 %12 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %11) #3 ret i32 0 } </code></pre> Note that there are no operations between the calls to <code>clock()</code> for either case. So they are both compiled to the exact same thing.

The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code. Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a <code>nop</code> makes the loop long enough to avoid the stall. One thing that happens if we look at the code: <pre class="prettyprint"><code>0x00000000004005af <+50>: addq $0x1,-0x20(%rbp) 0x00000000004005b4 <+55>: cmpq $0x7fffffff,-0x20(%rbp) 0x00000000004005bc <+63>: jb 0x4005af <main+50> </code></pre> We read <code>i</code> and write it back (<code>addq</code>). We read it immediately again, and compare it (<code>cmpq</code>). And then we loop. But the loop uses branch prediction. So at the time when the <code>addq</code> is executed, the processor isn't really sure it is allowed to write to <code>i</code> (because branch prediction could be wrong). Then we compare to <code>i</code>. The processor will try to avoid reading <code>i</code> from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to <code>i</code> by adding to it, and instead of reading <code>i</code>, the <code>cmpq</code> instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to <code>i</code> actually happened or not! So that could introduce a stall here. The problem here is that the conditional jump, the <code>addq</code> which leads to a conditional store, and the <code>cmpq</code> which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take <code>i</code> from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one <code>nop</code> gives the processor enough time. Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest): <ol> <li>Memory (RAM)</li> <li>L3 cache (optional)</li> <li>L2 cache</li> <li>L1 cache</li> <li>Previous store instruction that hasn't written to L1 cache yet. </li> </ol> So what the processor does internally in the short, slow loop: <ol> <li>Read <code>i</code> from L1 cache</li> <li>Add 1 to <code>i</code> </li> <li>Write <code>i</code> to L1 cache</li> <li>Wait until <code>i</code> is written to L1 cache</li> <li>Read <code>i</code> from L1 cache</li> <li>Compare <code>i</code> with INT_MAX</li> <li>Branch to (1) if it is less. </li> </ol> In the long, fast, loop the processor does: <ol> <li>Lots of stuff</li> <li>Read <code>i</code> from L1 cache</li> <li>Add 1 to <code>i</code> </li> <li>Do a "store" instruction that will write <code>i</code> to L1 cache</li> <li>Read <code>i</code> directly from the "store" instruction without touching L1 cache</li> <li>Compare <code>i</code> with INT_MAX</li> <li>Branch to (1) if it is less. </li> </ol>

Empty loop is slower than a non-empty one in C

Q: Can we use empty for loop?

Empty Statement in a for Loop Here, the initialization block of the for loop contains nothing, hence an empty statement. For creating a loop that runs infinitely, we can use empty statements. However, if we use break statements inside the body of the loop, then the loop can terminate.

Q: What is empty loop example?

An empty loop is a loop which does not have any updation or value of iteration. For example, for(int i = 1;;) (in Java) An empty loop is infinite.

Q: Can a while loop have an empty body?

The body of the while can be empty. This is because a null statement, one that consists only of a semicolon, is syntactically valid in Java.

Tags:

performance

c

loops

While trying to know how long a line of C code used to execute, I noticed this weird thing :

int main (char argc, char * argv[]) {     time_t begin, end;     uint64_t i;     double total_time, free_time;     int A = 1;     int B = 1;      begin = clock();     for (i = 0; i<(1<<31)-1; i++);     end = clock();     free_time = (double)(end-begin)/CLOCKS_PER_SEC;     printf("%f\n", free_time);      begin = clock();     for (i = 0; i<(1<<31)-1; i++) {         A += B%2;     }     end = clock();     free_time = (double)(end-begin)/CLOCKS_PER_SEC;     printf("%f\n", free_time);      return(0); }

Which when executed displays :

5.873425 4.826874

Why does the empty loop use more time than the second which has an instruction within ? Of course I've tried many variants but everytime, an empty loop takes more time than one with one single instruction within.

Note that I have tried swapping loops order and adding some warmup code and it didn't change my problem at all.

I'm using codeblocks as IDE with GNU gcc compiler, linux ubuntu 14.04 and have a quadcore intel i5 at 2.3GHz (I've tried running the programm on a single core, this doesn't change the result).

238

asked Jul 31 '14 20:07

Celerio

2 Answers

Assuming that your code uses a 32 bit integer int type (which your system probably does), then nothing can be determined from your code. Instead, it exhibits undefined behavior.

foo.c:5:5: error: first parameter of 'main' (argument count) must be of type 'int' int main (char argc, char * argv[]) {     ^ foo.c:13:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]     for (i = 0; i<(1<<31)-1; i++);                          ^ foo.c:19:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]     for (i = 0; i<(1<<31)-1; i++) {                          ^

Let's try to fix that:

#include <stdint.h> #include <stdio.h> #include <time.h> #include <limits.h>  int main (int argc, char * argv[]) {     time_t begin, end;     uint64_t i;     double total_time, free_time;     int A = 1;     int B = 1;      begin = clock();     for (i = 0; i<INT_MAX; i++);     end = clock();     free_time = (double)(end-begin)/CLOCKS_PER_SEC;     printf("%f\n", free_time);      begin = clock();     for (i = 0; i<INT_MAX; i++) {         A += B%2;     }     end = clock();     free_time = (double)(end-begin)/CLOCKS_PER_SEC;     printf("%f\n", free_time);      return(0); }

Now, let's look at the assembly output of this code. Personally, I find LLVM's internal assembly very readable, so I'm going to show that. I'll produce it by running:

clang -O3 foo.c -S -emit-llvm -std=gnu99

Here's the relevant portion of the output (the main function):

define i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 {   %1 = tail call i64 @"\01_clock"() #3   %2 = tail call i64 @"\01_clock"() #3   %3 = sub nsw i64 %2, %1   %4 = sitofp i64 %3 to double   %5 = fdiv double %4, 1.000000e+06   %6 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %5) #3   %7 = tail call i64 @"\01_clock"() #3   %8 = tail call i64 @"\01_clock"() #3   %9 = sub nsw i64 %8, %7   %10 = sitofp i64 %9 to double   %11 = fdiv double %10, 1.000000e+06   %12 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %11) #3   ret i32 0 }

Note that there are no operations between the calls to clock() for either case. So they are both compiled to the exact same thing.

133

answered Sep 25 '22 21:09

Bill Lynch

The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code.

Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a nop makes the loop long enough to avoid the stall.

One thing that happens if we look at the code:

0x00000000004005af <+50>:    addq   $0x1,-0x20(%rbp) 0x00000000004005b4 <+55>:    cmpq   $0x7fffffff,-0x20(%rbp) 0x00000000004005bc <+63>:    jb     0x4005af <main+50>

We read i and write it back (addq). We read it immediately again, and compare it (cmpq). And then we loop. But the loop uses branch prediction. So at the time when the addq is executed, the processor isn't really sure it is allowed to write to i (because branch prediction could be wrong).

Then we compare to i. The processor will try to avoid reading i from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to i by adding to it, and instead of reading i, the cmpq instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to i actually happened or not! So that could introduce a stall here.

The problem here is that the conditional jump, the addq which leads to a conditional store, and the cmpq which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take i from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one nop gives the processor enough time.

Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest):

Memory (RAM)
L3 cache (optional)
L2 cache
L1 cache
Previous store instruction that hasn't written to L1 cache yet.

So what the processor does internally in the short, slow loop:

Read i from L1 cache
Add 1 to i
Write i to L1 cache
Wait until i is written to L1 cache
Read i from L1 cache
Compare i with INT_MAX
Branch to (1) if it is less.

In the long, fast, loop the processor does:

Lots of stuff
Read i from L1 cache
Add 1 to i
Do a "store" instruction that will write i to L1 cache
Read i directly from the "store" instruction without touching L1 cache
Compare i with INT_MAX
Branch to (1) if it is less.

answered Sep 23 '22 21:09

gnasher729

Related questions
                            
                                What exactly is va_end for? Is it always necessary to call it?
                            
                                Hide password input on terminal
                            
                                Why is -(-2147483648) = - 2147483648 in a 32-bit machine?
                            
                                How does the C preprocessor handle circular dependencies?
                            
                                Arithmetic bit-shift on a signed integer
                            
                                What is the purpose of epoll's edge triggered option?
                            
                                How to get absolute value from double - c-language
                            
                                In a GNU C macro envSet(name), what does (void) "" name mean?
                            
                                redirect stdout/stderr to a string
                            
                                Is void a data type in C?
                            
                                Why is argc an 'int' (rather than an 'unsigned int')?
                            
                                In C, sizeof operator returns 8 bytes when passing 2.5m but 4 bytes when passing 1.25m * 2
                            
                                Error: Address already in use while binding socket with address but the port number is shown free by `netstat`
                            
                                How do header and source files in C work?
                            
                                Floating point comparison [duplicate]
                            
                                How do you get a directory listing in C?
                            
                                Using if (!!(expr)) instead of if (expr)
                            
                                How to detect LLVM and its version through #define directives?
                            
                                How to return a value from pthread threads in C?
                            
                                The difference of int8_t, int_least8_t and int_fast8_t?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With