I'm trying to optimize my code down to the last possible cycle, and am wondering if the loop type affects performance when used for array indexing?
I've done some experiments with the following program that just fills an array with 0:
int main(int argc, char **argv)
{
typedef int CounterType;
typedef int64_t CounterType;
CounterType N = atoi(argv[1]);
uint8_t volatile dummy[N + 16];
__m128i v = _mm_set1_epi8(0);
for (int j = 0; j < 1000000; ++j)
{
#pragma nounroll
for (CounterType i = 0; i <= N; i+= CounterType(16))
{
_mm_storeu_si128((__m128i *)&dummy[i], v);
}
}
return 0;
}
By using different loop counter types (CounterType) and different compilers, I've recorded the assembly code of the inner loop and performance using hardware performance counters ("perf stat a.out 32768"). I'm running on a Xeon 5670.
GCC4.9, int
.L3
movups %xmm0, (%rax)
addq $16, %rax
movl %eax, %edx
subl %esi, %edx
cmpl %ecx, %edx
jle .L3
4,127,525,521 cycles # 2.934 GHz
12,304,723,292 instructions # 2.98 insns per cycle
GCC4.9, int64
.L7
movups %xmm0, (%rcx,%rax)
addq $16, %rax
cmpq %rax, %rdx
jge .L7
4,123,315,191 cycles # 2.934 GHz
8,206,745,195 instructions # 1.99 insns per cycle
ICC11, int64
..B1.6:
movdqu %xmm0, (%rdx,%rdi)
addq $16, %rdx
incq %rcx
cmpq %rbx, %rcx
jb ..B1.6 # Prob 82% #24.5
2,069,719,166 cycles # 2.934 GHz
5,130,061,268 instructions
(faster because of micro-op fusion?)
ICC11, int
..B1.6: # Preds ..B1.4 ..B1.6
movdqu %xmm0, (%rdx,%rbx) #29.38
addq $16, %rdx #24.37
cmpq %rsi, %rdx #24.34
jle ..B1.6 # Prob 82% #24.34
4,136,109,529 cycles # 2.934 GHz
8,206,897,268 instructions
ICC13, int & int64
movdqu %xmm0, (%rdi,%rax) #29.38
addq $16, %rdi #24.37
cmpq %rsi, %rdi #24.34
jle ..B1.7
4,123,963,321 cycles # 2.934 GHz
8,206,083,789 instructions # 1.99 insns per cycle
The data seems to suggest that int64 is faster. Maybe this is because it matches the pointer size, therefore avoiding any conversions. But I'm not convinced of that conclusion. Another possibility might be that the compiler decided in some cases to do the loop comparison before the store to achieve more parallelism at the cost of 1 extra instruction (due to X86 2 operand instructions being destructive). But that would be incidental, and not fundamentally caused by the loop variable type.
Can some one explain this mystery (preferably knowledgeable about compiler transformations)?
There was also a claim in the CUDA C Best Practices Guide that signed loop counters are simpler than unsigned to generate code for. But that doesn't seem to be relevant here because there are no multiplies in the inner loop for address calculation because that expression gets turned into an induction variable. But apparently in CUDA, it prefers using multiply-add to compute addresses since MADD is 1 instruction just like add and it can cut register use by 1.
Yes the loop variable type can affect efficiency.
Let me suggest an even better solution with GCC.
void distance(uint8_t* dummy, size_t n, const __m128 v0)
{
intptr_t i;
for(i = -n; i < 0; i += 4) {
_mm_store_ps(&((float*)dummy)[i+n], v0);
}
}
With GCC 4.9.2 and GCC 5.3 this produces this main loop
.L5:
vmovaps %xmm0, (%rdi,%rax)
addq $16, %rax
js .L5
Clang 3.6 however still generates the cmp
.LBB0_2: # =>This Inner Loop Header:
vmovaps %xmm0, 8(%rdi,%rax)
addq $4, %rax
cmpq $-4, %rax
jl .LBB0_2
and Clang 3.7 unrolls four times and uses a cmp
.
ICC 13 unrolls twice and uses a cmp
so only GCC manages to do this without the unnecessary cmp
instruction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With