I have Ivy-Bridge CPU. The following code takes 3 cycle per iteration:
L1:
movapd xmm1, [rsi+rax] ; X[i], X[i+1]
mulpd xmm1, xmm2
movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
subpd xmm0, xmm1
movapd [rdi+rax], xmm0 ; Store result
add rax, 16
cmp rax, rcx
jl L1
The following takes 9 cycles per iteration:
L1:
movapd xmm1, [rsi+rax] ; X[i], X[i+1]
mulpd xmm1, xmm2
movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
add rax, 16
subpd xmm0, xmm1
movapd [rdi+rax], xmm0 ; Store result
cmp rax, rcx
jl L1
The only difference is order ( add rax, 16). And it causes that the program is 3 times slower. Why the difference is so substantial?
The main reason is that it stores the result in a different location, which also happens to be the location read by the next iteration of the loop.
Doing that interferes with the CPU's out-of-order execution -- the next iteration of the loop can't start until the current iteration completes, due to the data dependency.
I would imagine if you change the store instruction to store back to the same location, the second loop would become substantially faster again:
movapd [rdi+rax-16], xmm0 ; Store result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With