Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Two similar assembly code. A Substantial difference

I have Ivy-Bridge CPU. The following code takes 3 cycle per iteration:

L1:    
    movapd xmm1, [rsi+rax] ; X[i], X[i+1]
    mulpd xmm1, xmm2
    movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
    subpd xmm0, xmm1
    movapd [rdi+rax], xmm0 ; Store result
    add rax, 16
    cmp rax, rcx
    jl L1

The following takes 9 cycles per iteration:

L1:
    movapd xmm1, [rsi+rax] ; X[i], X[i+1]
    mulpd xmm1, xmm2
    movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
    add rax, 16
    subpd xmm0, xmm1
    movapd [rdi+rax], xmm0 ; Store result
    cmp rax, rcx
    jl L1

The only difference is order ( add rax, 16). And it causes that the program is 3 times slower. Why the difference is so substantial?

like image 924
Gilgamesz Avatar asked May 26 '26 12:05

Gilgamesz


1 Answers

The main reason is that it stores the result in a different location, which also happens to be the location read by the next iteration of the loop.

Doing that interferes with the CPU's out-of-order execution -- the next iteration of the loop can't start until the current iteration completes, due to the data dependency.

I would imagine if you change the store instruction to store back to the same location, the second loop would become substantially faster again:

movapd [rdi+rax-16], xmm0 ; Store result
like image 121
Chris Dodd Avatar answered May 28 '26 07:05

Chris Dodd



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!