I write a lot of vectorized loops, so 1 common idiom is
volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16) // process all elements with whole vector
{
int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)
But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)
.L3:
leaq -16(%rax), %rdx
addq $16, %rax
cmpq %rcx, %rax
movl -120(%rsp,%rdx,4), %edx
jbe .L3
If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16)
, then the "extra"
instruction is gone:
.L2:
movl -120(%rsp,%rax,4), %ecx
addq $16, %rax
cmpq %rdx, %rax
jbe .L2
But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions. So another explanation could be that it's trading instruction parallelism for 1 extra instruction.
Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?
Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)
is executed n-1 times because the upper bound of the range is non-inclusive. So it will be executed when j=0, j=1, ... j = n-1. The loop itself will execute one more time (i.e. when j=n) and see that the value is no longer in range and will not execute the body.
With three arguments, the sequence starts at the first value, ends before the second argument and increments or decrements by the third value.
It limits the scope of counter variables better than a while loop , hence it helps in better memory management.
In the for loop, this is happening because of the initialization part i.e., for(i=0;..). Different kinds of loops provide convenience to a programmer while writing code and one can use any kind of loop any number of times in a program.
If argc
were below -2147483632, and i
was below 2147483632, the expressions i+16 <= argc
would be required to yield an arithmetically-correct result, while the expression and i<argc-16
would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With