I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this,
const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f };
but that generates 4 movss instructions from memory!
movss xmm0,dword ptr [__real@3f800000 (14048E534h)]
movss xmm1,dword ptr [__real@40000000 (14048E530h)]
movaps xmm6,xmm12
shufps xmm6,xmm12,0C6h
movss dword ptr [rsp],xmm0
movss xmm0,dword ptr [__real@40400000 (14048E52Ch)]
movss dword ptr [rsp+4],xmm1
movss xmm1,dword ptr [__real@40a00000 (14048E528h)]
which load the scalars in and out of memory... (?!?!)
Doing this though..
float Align(16) myfloat4[4] = { 1.0f, 2.0f, 3.0f, 4.0f, }; // out in global scope
generates.
movaps xmm5,xmmword ptr [::myarray4 (140512050h)]
Ideally, it would be nice if I have constants their would be a way not to even touch memory and just do it with immediate style instructions (e.g. the constants compiled into the instruction itself).
Thanks
If you want to force it to a single load, you could try (gcc):
__attribute__((aligned(16))) float vec[4] = { 1.0f, 1.1f, 1.2f, 1.3f };
__m128 v = _mm_load_ps(vec); // edit by sor: removed the "&" cause its already an address
If you have Visual C++, use __declspec(align(16))
to request the proper constraint.
On my system, this (compiled with gcc -m32 -msse -O2
; no optimization at all clutters the code but still retains the single movaps
in the end) creates the following assembly code (gcc / AT&T syntax):
andl $-16, %esp
subl $16, %esp
movl $0x3f800000, (%esp)
movl $0x3f8ccccd, 4(%esp)
movl $0x3f99999a, 8(%esp)
movl $0x3fa66666, 12(%esp)
movaps (%esp), %xmm0
Note that it aligns the stackpointer before allocating stackspace and putting the constants in there. Leaving the __attribute__((aligned))
out may, depending on your compiler, create incorrect code that doesn't do this, so beware, and check the disassembly.
Additionally:
Since you've been asking for how to put constants into the code, simply try the above with a static
qualifier for the float
array. That creates the following assembly:
movaps vec.7330, %xmm0
...
vec.7330:
.long 1065353216
.long 1066192077
.long 1067030938
.long 1067869798
First off, what optimization level are you compiling at? It's not uncommon to see that sort of codegen at -O0 or -O1, but I would be quite surprised to see it with -O2 or higher in most compilers.
Second, there are no immediate loads in SSE. You can do a load immediate to a GPR, then move that value to SSE, but you cannot conjure other values without an actual load (ignoring certain special values like 0
or (int)-1
, which can be produced via logical operations.
Finally, if the bad code is being generated with optimizations turned on and in a performance-critical location, please file a bug against your compiler.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With