Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why movlps and movhps SSE instructions are faster than movups for transferring misaligned data?

I found that in some of SSE optimized code for doing math calculation, they use the combination of movlps and movhps instructions instead of a single movups instruction to transfer misaligned data. I don't know why, so I tried it myself, and it's the pseudo code below:

struct Vec4
{
    float f[4];
};

const size_t nSize = sizeof(Vec4) * 100;
Vec4* pA = (Vec4*)malloc( nSize );
Vec4* pB = (Vec4*)malloc( nSize );
Vec4* pR = (Vec4*)malloc( nSize );

...Some data initialization code here
...Records current time by QueryPerformanceCounter()

for( int i=0; i<100000, ++i )
{
    for( int j=0; j<100; ++j )
    {
          Vec4* a = &pA[i];
          Vec4* b = &pB[i];
          Vec4* r = &pR[i];
          __asm
          {
              mov eax, a
              mov ecx, b
              mov edx, r

              ...option 1:

              movups xmm0, [eax]
              movups xmm1, [ecx]
              mulps xmm0, xmm1
              movups [edx], xmm0

              ...option 2:

              movlps xmm0, [eax]
              movhps xmm0, [eax+8]
              movlps xmm1, [ecx]
              movhps xmm1, [ecx+8]
              mulps xmm0, xmm1
              movlps [edx], xmm0
              movhps [edx+8], xmm0
         }
    }
}

...Calculates passed time

free( pA );
free( pB );
free( pR );

I ran the code for many times and calculated their average time consuming.

For movups version, the result is about 50ms.

For movlps, movhps version, the result is about 46ms.

And I also tried a data aligned version with __declspec(align(16)) descriptor on structure, and allocated by _aligned_malloc(), the result is about 34ms.

Why the combination of movlps and movhps is faster? Does it mean we'd better to use movlps and movhps instead of movups?

like image 704
SeaStar Avatar asked Nov 23 '12 02:11

SeaStar


1 Answers

Athlons of this generation (K8) only have 64 bit wide ALU units. So every 128 bit SSE instruction needs to be split into two 64 bit instructions, which incurs a overhead for some instructions.

On this type of processor you will generally find no speedup using SSE compared to equal MMX code.

Quoting Agner Fog in The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers:

12.9 64 bit versus 128 bit instructions

It is a big advantage to use 128-bit instructions on K10, but not on K8 because each 128-bit instruction is split into two 64-bit macro-operations on the K8.

128 bit memory write instructions are handled as two 64-bit macro-operations on K10, while 128 bit memory read is done with a single macro-operation on K10 (2 on K8).

128 bit memory read instructions use only the FMISC unit on K8, but all three units on K10. It is therefore not advantageous to use XMM registers just for moving blocks of data from one memory position to another on the k8, but it is advantageous on K10.

like image 171
Gunther Piez Avatar answered Sep 21 '22 12:09

Gunther Piez