Using XMM0 register and memory fetches (C++ code) is twice as fast as ASM only using XMM registers - Why?

Question

I'm trying to implement some inline assembler (in Visual Studio 2012 C++ code) to take advantage of SSE. I want to add 7 numbers for 1e9 times so i placed them from RAM to xmm0 to xmm6 registers of CPU. when i do it with inline assembly in visual studio 2012 with this code:

the C++ code:

for(int i=0;i<count;i++)
        resVal+=val1+val2+val3+val4+val5+val6+val7;

my ASM code:

int count=1000000000;

    double resVal=0.0;
       //placing values to register
    __asm{  
        movsd xmm0,val1;placing var1 in xmm0 register  
        movsd xmm1,val2  
        movsd xmm2,val3  
        movsd xmm3,val4  
        movsd xmm4,val5  
        movsd xmm5,val6  
        movsd xmm6,val7  
        pxor xmm7,xmm7;//turns xmm7 to zero
         }

    for(int i=0;i<count;i++)
    {
        __asm
        {
            addsd xmm7,xmm0;//+=var1
            addsd xmm7,xmm1;//+=var2
            addsd xmm7,xmm2;
            addsd xmm7,xmm3;
            addsd xmm7,xmm4;
            addsd xmm7,xmm5;
            addsd xmm7,xmm6;//+=var7
        }

    }

    __asm
        {
            movsd resVal,xmm7;//placing xmm7 into resVal
        }

and this is the dis assembled code from C++ compiler for the code 'resVal+=val1+val2+val3+val4+val5+val6+val7':

movsd       xmm0,mmword ptr [val1]  
addsd       xmm0,mmword ptr [val2]  
addsd       xmm0,mmword ptr [val3]  
addsd       xmm0,mmword ptr [val4]  
addsd       xmm0,mmword ptr [val5]  
addsd       xmm0,mmword ptr [val6]  
addsd       xmm0,mmword ptr [val7]  
addsd       xmm0,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm0

As is visible the compiler uses just one xmm0 register and for other times it is fetching values from RAM.

Answer of both codes (my ASM code and c++ code) is same but the c++ code takes about half the time of my asm code to execute!

I was readed about CPU registers that working with them is much faster than memory. I dont think this ratio be true. Why the asm version have lower performance of C++ code?

Mats Petersson · Accepted Answer

Once the data is in the cache (which it will be the case after the first loop, if it's not there already), it makes little difference if you use memory or register.
A floating point add will take a little longer than single cycle in the first place.
The final store to resVal "unties" the xmm0 register to allow the register to be freely "renamed", which allows more of the loops to be run in parallel.

This is a typical case of "unless you are absolutely sure, leave writing code to the compiler".

The last bullet above explains why the code is faster than code where every step of the loop depends on a previously calculated result.

In the compiler generated code, the loop can do the equivalent of:

movsd       xmm0,mmword ptr [val1]  
addsd       xmm0,mmword ptr [val2]  
addsd       xmm0,mmword ptr [val3]  
addsd       xmm0,mmword ptr [val4]  
addsd       xmm0,mmword ptr [val5]  
addsd       xmm0,mmword ptr [val6]  
addsd       xmm0,mmword ptr [val7]  
addsd       xmm0,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm0  

movsd       xmm1,mmword ptr [val1]  
addsd       xmm1,mmword ptr [val2]  
addsd       xmm1,mmword ptr [val3]  
addsd       xmm1,mmword ptr [val4]  
addsd       xmm1,mmword ptr [val5]  
addsd       xmm1,mmword ptr [val6]  
addsd       xmm1,mmword ptr [val7]  
addsd       xmm1,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm1

Now, as you can see, we could "mingle" these two "threads":

movsd       xmm0,mmword ptr [val1]  
movsd       xmm1,mmword ptr [val1]  
addsd       xmm0,mmword ptr [val2]  
addsd       xmm1,mmword ptr [val2]  
addsd       xmm0,mmword ptr [val3]  
addsd       xmm1,mmword ptr [val3]  
addsd       xmm0,mmword ptr [val4]  
addsd       xmm1,mmword ptr [val4]  
addsd       xmm0,mmword ptr [val5]  
addsd       xmm1,mmword ptr [val5]  
addsd       xmm0,mmword ptr [val6]  
addsd       xmm1,mmword ptr [val6]  
addsd       xmm0,mmword ptr [val7]  
addsd       xmm1,mmword ptr [val7]  
addsd       xmm0,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm0  
// Here we have to wait for resval to be uppdated!
addsd       xmm1,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm1

I'm not suggesting it is quite that much out of order execution, but I can certainly see how the loop can be executed faster that your loop. You can probably achieve the same thing in your assembler code if you had a spare register [in x86_64 you do have another 8 registers, although you can't use inline assembler in x86_64...]

(Note that register renaming is different from my "threaded" loop, which is using two different registers - but the effect is roughly the same, the loop can continue after it hits the "resVal" update without having to wait for the result to be updated)

Using XMM0 register and memory fetches (C++ code) is twice as fast as ASM only using XMM registers - Why?

Tags:

c++

performance

optimization

assembly

sse2

epsi1on

1 Answers

Mats Petersson

Recent Activity

Donate For Us

Using XMM0 register and memory fetches (C++ code) is twice as fast as ASM only using XMM registers - Why?

Tags:

c++

performance

optimization

assembly

sse2

epsi1on

1 Answers

Mats Petersson

Related questions

Recent Activity

Donate For Us