I checked out Visual Studio 2012's assembly output from the following SIMD code:
float *end = arr + sz;
float *b = other.arr;
for (float *a = arr; a < end; a += 4, b += 4)
{
__m128 ax = _mm_load_ps(a);
__m128 bx = _mm_load_ps(b);
ax = _mm_add_ps(ax, bx);
_mm_store_ps(a, ax);
}
The loop body is:
$LL11@main:
movaps xmm1, XMMWORD PTR [eax+ecx]
addps xmm1, XMMWORD PTR [ecx]
add ecx, 16 ; 00000010H
movaps XMMWORD PTR [ecx-16], xmm1
cmp ecx, edx
jb SHORT $LL11@main
Why increment ecx
by 16, only to subtract 16 when storing to it the next line?
Well, there are basically two options here.
add ecx, 16
movaps XMMWORD PTR [ecx-16], xmm1 ; stall for ecx?
cmp ecx, edx
jb loop
or
movaps XMMWORD PTR [ecx], xmm1
add ecx, 16
cmp ecx, edx ; stall for ecx?
jb loop
In option 1 you have a potential stall between add
and movaps
. In option 2 you have a potential stall between add
and cmp
. However, there is also the issue of the execution unit used. add
and cmp
(=sub
) use the ALU, while the [ecx-16]
uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.
ADDPS has a latency of 3 cycles, plus a memory load, so the following ADD, which is much quicker, will complete before the next MOVAPS, that needs the result of ADDPS in the xmm1 register, can start.
Indeed this is a bit strange.
Many compilers avoid to read a register in the instruction after it has modified because such code runs slower on some processors. Example:
; Code that runs fast:
add ecx, 16
mov esi, edi
cmp ecx, edx
; Code doing the same that may run slower:
mov esi, edi
add ecx, 16
cmp ecx, edx
For this reason compilers often change the order of the assembler instructions. However in your case this is definitely not the reason.
Maybe the optimization code of the compiler is not written 100% correctly and it therefore does this kind of "optimization".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With