Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load constant floats into SSE registers

Tags:

assembly

sse

I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this,

const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f }; 

but that generates 4 movss instructions from memory!

movss       xmm0,dword ptr [__real@3f800000 (14048E534h)] 
movss       xmm1,dword ptr [__real@40000000 (14048E530h)] 
movaps      xmm6,xmm12 
shufps      xmm6,xmm12,0C6h 
movss       dword ptr [rsp],xmm0 
movss       xmm0,dword ptr [__real@40400000 (14048E52Ch)] 
movss       dword ptr [rsp+4],xmm1 
movss       xmm1,dword ptr [__real@40a00000 (14048E528h)] 

which load the scalars in and out of memory... (?!?!)

Doing this though..

float Align(16) myfloat4[4] = { 1.0f, 2.0f, 3.0f, 4.0f, }; // out in global scope

generates.

movaps      xmm5,xmmword ptr [::myarray4 (140512050h)]

Ideally, it would be nice if I have constants their would be a way not to even touch memory and just do it with immediate style instructions (e.g. the constants compiled into the instruction itself).

Thanks

like image 488
coderdave Avatar asked Feb 15 '11 18:02

coderdave


2 Answers

If you want to force it to a single load, you could try (gcc):

__attribute__((aligned(16))) float vec[4] = { 1.0f, 1.1f, 1.2f, 1.3f };
__m128 v = _mm_load_ps(vec); // edit by sor: removed the "&" cause its already an address

If you have Visual C++, use __declspec(align(16)) to request the proper constraint.

On my system, this (compiled with gcc -m32 -msse -O2; no optimization at all clutters the code but still retains the single movaps in the end) creates the following assembly code (gcc / AT&T syntax):

    andl    $-16, %esp
    subl    $16, %esp
    movl    $0x3f800000, (%esp)
    movl    $0x3f8ccccd, 4(%esp)
    movl    $0x3f99999a, 8(%esp)
    movl    $0x3fa66666, 12(%esp)
    movaps  (%esp), %xmm0

Note that it aligns the stackpointer before allocating stackspace and putting the constants in there. Leaving the __attribute__((aligned)) out may, depending on your compiler, create incorrect code that doesn't do this, so beware, and check the disassembly.

Additionally:
Since you've been asking for how to put constants into the code, simply try the above with a static qualifier for the float array. That creates the following assembly:

    movaps  vec.7330, %xmm0
    ...
vec.7330:
    .long   1065353216
    .long   1066192077
    .long   1067030938
    .long   1067869798
like image 89
FrankH. Avatar answered Sep 20 '22 20:09

FrankH.


First off, what optimization level are you compiling at? It's not uncommon to see that sort of codegen at -O0 or -O1, but I would be quite surprised to see it with -O2 or higher in most compilers.

Second, there are no immediate loads in SSE. You can do a load immediate to a GPR, then move that value to SSE, but you cannot conjure other values without an actual load (ignoring certain special values like 0 or (int)-1, which can be produced via logical operations.

Finally, if the bad code is being generated with optimizations turned on and in a performance-critical location, please file a bug against your compiler.

like image 28
Stephen Canon Avatar answered Sep 21 '22 20:09

Stephen Canon