Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why this SSE2 program (integers) generate movaps (float)?

The following loops transpose an integer matrix to another integer matrix. when I compiled interestingly it generates movaps instruction to store the result into the output matrix. why gcc does this?

data:

int __attribute__(( aligned(16))) t[N][M]  
  , __attribute__(( aligned(16))) c_tra[N][M];

loops:

for( i=0; i<N; i+=4){
    for(j=0; j<M; j+=4){

        row0 = _mm_load_si128((__m128i *)&t[i][j]);
        row1 = _mm_load_si128((__m128i *)&t[i+1][j]);
        row2 = _mm_load_si128((__m128i *)&t[i+2][j]);
        row3 = _mm_load_si128((__m128i *)&t[i+3][j]);

        __t0 = _mm_unpacklo_epi32(row0, row1);
        __t1 = _mm_unpacklo_epi32(row2, row3);
        __t2 = _mm_unpackhi_epi32(row0, row1);
        __t3 = _mm_unpackhi_epi32(row2, row3);

        /* values back into I[0-3] */
        row0 = _mm_unpacklo_epi64(__t0, __t1);
        row1 = _mm_unpackhi_epi64(__t0, __t1);
        row2 = _mm_unpacklo_epi64(__t2, __t3);
        row3 = _mm_unpackhi_epi64(__t2, __t3);

        _mm_store_si128((__m128i *)&c_tra[j][i], row0);
        _mm_store_si128((__m128i *)&c_tra[j+1][i], row1);
        _mm_store_si128((__m128i *)&c_tra[j+2][i], row2);
        _mm_store_si128((__m128i *)&c_tra[j+3][i], row3);



    }
}

Assembly generated code:

.L39:
    lea rcx, [rsi+rdx]
    movdqa  xmm1, XMMWORD PTR [rdx]
    add rdx, 16
    add rax, 2048
    movdqa  xmm6, XMMWORD PTR [rcx+rdi]
    movdqa  xmm3, xmm1
    movdqa  xmm2, XMMWORD PTR [rcx+r9]
    punpckldq   xmm3, xmm6
    movdqa  xmm5, XMMWORD PTR [rcx+r10]
    movdqa  xmm4, xmm2
    punpckhdq   xmm1, xmm6
    punpckldq   xmm4, xmm5
    punpckhdq   xmm2, xmm5
    movdqa  xmm5, xmm3
    punpckhqdq  xmm3, xmm4
    punpcklqdq  xmm5, xmm4
    movdqa  xmm4, xmm1
    punpckhqdq  xmm1, xmm2
    punpcklqdq  xmm4, xmm2
    movaps  XMMWORD PTR [rax-2048], xmm5
    movaps  XMMWORD PTR [rax-1536], xmm3
    movaps  XMMWORD PTR [rax-1024], xmm4
    movaps  XMMWORD PTR [rax-512], xmm1
    cmp r11, rdx
    jne .L39

gcc -Wall -msse4.2 -masm="intel" -O2 -c -S skylake linuxmint

-mavx2 or -march=naticve generate VEX-encoding :vmovaps.

like image 736
Hossein Amiri Avatar asked Feb 05 '23 02:02

Hossein Amiri


1 Answers

Functionally those instructions are the same. I don't like to copy+paste other people statements as mine so few links explaining it:

Difference between MOVDQA and MOVAPS x86 instructions?

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587

http://masm32.com/board/index.php?topic=1138.0

https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/

Short version:

So for the most part, you should try to use the move instruction that corresponds with the operations you are going to use on those registers. However, there is an additional complication. Loads and stores to and from memory execute on a separate port from the integer and floating point units; thus instructions that load from memory into a register or store from a register into memory will experience the same delay regardless of the data type you attach to the move. Thus in this case, movaps, movapd, and movdqa will have the same delay no matter what data you use. Since movaps (and movups) is encoded in binary form with one less byte than the other two, it makes sense to use it for all reg-mem moves, regardless of the data type.

So it is GCC optimization.

like image 188
Anty Avatar answered Feb 07 '23 16:02

Anty