I write a lot of vectorized loops, so 1 common idiom is <pre class="prettyprint"><code>volatile int dummy[1<<10]; for (int64_t i = 0; i + 16 <= argc; i+= 16) // process all elements with whole vector { int x = dummy[i]; } // handle remainder (hopefully with SIMD too) </code></pre> But the resulting machine code has 1 more instruction than I would like (using gcc 4.9) <pre class="prettyprint"><code>.L3: leaq -16(%rax), %rdx addq $16, %rax cmpq %rcx, %rax movl -120(%rsp,%rdx,4), %edx jbe .L3 </code></pre> If I change the code to <code>for (int64_t i = 0; i <= argc - 16; i+= 16)</code>, then the "extra" instruction is gone: <pre class="prettyprint"><code>.L2: movl -120(%rsp,%rax,4), %ecx addq $16, %rax cmpq %rdx, %rax jbe .L2 </code></pre> But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions. So another explanation could be that it's trading instruction parallelism for 1 extra instruction. Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)? Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)

If <code>argc</code> were below -2147483632, and <code>i</code> was below 2147483632, the expressions <code>i+16 <= argc</code> would be required to yield an arithmetically-correct result, while the expression and <code>i<argc-16</code> would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.

why for loop has 1 extra instruction than expected?

Q: Why does a loop run'n 1 times?

is executed n-1 times because the upper bound of the range is non-inclusive. So it will be executed when j=0, j=1, ... j = n-1. The loop itself will execute one more time (i.e. when j=n) and see that the value is no longer in range and will not execute the body.

Q: What are the 3 arguments of the for loop?

With three arguments, the sequence starts at the first value, ends before the second argument and increments or decrements by the third value.

Q: Why for loop is efficient?

It limits the scope of counter variables better than a while loop , hence it helps in better memory management.

Q: Can we use for loop two times?

In the for loop, this is happening because of the initialization part i.e., for(i=0;..). Different kinds of loops provide convenience to a programmer while writing code and one can use any kind of loop any number of times in a program.

I write a lot of vectorized loops, so 1 common idiom is

volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16)   // process all elements with whole vector
{
  int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)

But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)

.L3:
        leaq    -16(%rax), %rdx
        addq    $16, %rax
        cmpq    %rcx, %rax
        movl    -120(%rsp,%rdx,4), %edx
        jbe     .L3

If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16), then the "extra" instruction is gone:

.L2:
        movl    -120(%rsp,%rax,4), %ecx
        addq    $16, %rax
        cmpq    %rdx, %rax
        jbe     .L2

But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions. So another explanation could be that it's trading instruction parallelism for 1 extra instruction.

Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?

Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)

Why does a loop run'n 1 times?

is executed n-1 times because the upper bound of the range is non-inclusive. So it will be executed when j=0, j=1, ... j = n-1. The loop itself will execute one more time (i.e. when j=n) and see that the value is no longer in range and will not execute the body.

What are the 3 arguments of the for loop?

With three arguments, the sequence starts at the first value, ends before the second argument and increments or decrements by the third value.

Why for loop is efficient?

It limits the scope of counter variables better than a while loop , hence it helps in better memory management.

Can we use for loop two times?

In the for loop, this is happening because of the initialization part i.e., for(i=0;..). Different kinds of loops provide convenience to a programmer while writing code and one can use any kind of loop any number of times in a program.

If argc were below -2147483632, and i was below 2147483632, the expressions i+16 <= argc would be required to yield an arithmetically-correct result, while the expression and i<argc-16 would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.

why for loop has 1 extra instruction than expected?

Tags:

optimization

x86

gcc

assembly

gcc4.9

Yale Zhang

People also ask

1 Answers

supercat

Recent Activity

Donate For Us

why for loop has 1 extra instruction than expected?

Tags:

optimization

x86

gcc

assembly

gcc4.9

Yale Zhang

People also ask

1 Answers

supercat

Related questions

Recent Activity

Donate For Us