Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is SIMD slower than scalar counterpart

this is yet another SSE is slower than normal code! Why? type of question.
I know that there are a bunch of similar questions but they don't seem to match my situation.

I am trying to implement Miller-Rabin primality test with Montgomery Modular Multiplication for fast modulo operations.
I tried to implement it in both scalar and SIMD way and it turns out that the SIMD version was around 10% slower.
that [esp+16] or [esp+12] is pointing to the modulo inverse of N if there's anyone wondering.

I am really puzzled over the fact that a supposedly 1 Latency 1c Throughput 1uops instruction psrldq takes more than 3 Latency 0.5c Throughput 1uops pmuludq.

Below is the code and the run time analysis on visual studio ran on Ryzen 5 3600.

Any idea on how to improve SIMD code and/or why is it slower than a scalar code is appreciated.

P.S. Seems like the run time analysis is off by one instruction for some reason

EDIT 1: the comment on the image was wrong, I attached a fixed version below:

    ;------------------------------------------
    ; Calculate base^d mod x
    ;
    ; eax = 1
    ; esi = x
    ; edi = bases[eax]
    ; ebp = d
    ; while d do
    ;     if d & 1 then eax = (eax * edi) mod x
    ;     edi = (edi*edi) mod x
    ;     d >>= 1
    ; end
    ;------------------------------------------

Scalar code:

LOOP_MODEXP:
    push eax

    test ebp, 1
    jz @F

    mul edi
    mov ecx, edx
    imul eax, DWORD PTR [esp+16]
    mul esi
    xor ebx, ebx
    sub ecx, edx
    cmovs ebx, esi
    add ecx, ebx
    mov DWORD PTR [esp], ecx
@@:
    mov edx, edi
    mulx ecx, edx, edi
    imul edx, DWORD PTR [esp+16]
    mulx eax, ebx, esi
    xor ebx, ebx
    sub ecx, eax
    cmovs ebx, esi
    add ecx, ebx
    mov edi, ecx

    pop eax

    shr ebp, 1
    jnz LOOP_MODEXP

Scalar run time analysis

SIMD code

    movd xmm2, DWORD PTR [esp+12]
    movd xmm3, esi
    pshufd xmm2, xmm2, 0
    pshufd xmm3, xmm3, 0
    
    movd xmm1, edi

    pshufd xmm1, xmm1, 0
    movdqa xmm0, xmm1

    pinsrd xmm0, eax, 2

LOOP_MODEXP:
    movdqa xmm4, xmm0
    pmuludq xmm0, xmm1
    movdqa xmm1, xmm0
    pmuludq xmm0, xmm2
    pmuludq xmm0, xmm3
    psubd xmm1, xmm0
    
    psrldq xmm1, 4
    pxor xmm0, xmm0
    pcmpgtd xmm0, xmm1
    blendvps xmm0, xmm3, xmm0
    paddd xmm0, xmm1

    movddup xmm1, xmm0

    test ebp, 1
    jnz @F
    blendps xmm0, xmm4, 4

@@:
    shr ebp, 1
    jnz LOOP_MODEXP

    pextrd eax, xmm0, 2

SIMD run time analysis

like image 563
quaver Avatar asked Sep 15 '25 09:09

quaver


1 Answers

  1. Your SIMD code wastes time mispredicting that test ebp, 1 / jnz branch. There’s no conditional move instruction in SSE, but you can still optimize away that test + branch with a few more instructions like this:
mov      ebx, ebp
and      ebx, 1
sub      ebx, 1
pxor     xmm5, xmm5
pinsrd   xmm5, ebx, 2
blendvps xmm0, xmm4, xmm5

instead of your

test    ebp, 1
jnz @F
blendps xmm0, xmm4, 4

The above code computes ebx = ( ebp & 1 ) ? 0 : -1;, inserts that integer into 3-rd lane of a zero vector, and uses that vector for the selector in blendvps instruction.

  1. This instruction is not needed: pcmpgtd xmm0, xmm1 Along with previous and next one, it computes this:
xmm0 = _mm_cmplt_epi32( xmm1, _mm_setzero_si128() );
xmm0 = _mm_blendv_ps( xmm0, xmm3, xmm0 );

Here’s an equivalent:

xmm0 = _mm_blendv_ps( _mm_setzero_si128(), xmm3, xmm1 );

That comparison instruction compares int32 lanes for xmm1 < 0. This results in the sign bit of these integers. _mm_blendv_ps instruction only tests the high bits in 32-bit lanes, you don’t really need to compare for xmm1 < 0 before that.

  1. Unless you need to support CPUs without AVX, you should use VEX encoding of the instructions, even for code dealing with 16-byte vectors. Your SIMD code uses legacy encoding, most of them take 2 arguments and write the result in the first one. Most VEX instructions take 3 arguments and write result into another one. This should get rid of the couple redundant move instructions like movdqa xmm4, xmm0.
like image 59
Soonts Avatar answered Sep 17 '25 18:09

Soonts