Is it safe/possible/advisable to cast floats directly to __m128
if they are 16 byte aligned?
I noticed using _mm_load_ps
and _mm_store_ps
to "wrap" a raw array adds a significant overhead.
What are potential pitfalls I should be aware of?
EDIT :
There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128
instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps
instruction, probably falling back to some fail safe code path.
What makes you think that _mm_load_ps
and _mm_store_ps
"add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).
There are several ways to put float
values into SSE registers; the following intrinsics can be used:
__m128 sseval;
float a, b, c, d;
sseval = _mm_set_ps(a, b, c, d); // make vector from [ a, b, c, d ]
sseval = _mm_setr_ps(a, b, c, d); // make vector from [ d, c, b, a ]
sseval = _mm_load_ps(&a); // ill-specified here - "a" not float[] ...
// same as _mm_set_ps(a[0], a[1], a[2], a[3])
// if you have an actual array
sseval = _mm_set1_ps(a); // make vector from [ a, a, a, a ]
sseval = _mm_load1_ps(&a); // load from &a, replicate - same as previous
sseval = _mm_set_ss(a); // make vector from [ a, 0, 0, 0 ]
sseval = _mm_load_ss(&a); // load from &a, zero others - same as prev
The compiler will often create the same instructions no matter whether you state _mm_set_ss(val)
or _mm_load_ss(&val)
- try it and disassemble your code.
It can, in some cases, be advantageous to write _mm_set_ss(*valptr)
instead of _mm_load_ss(valptr)
... depends on (the structure of) your code.
Going by http://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx, it's possible but not safe or recommended.
You should not access the __m128 fields directly.
And here's the reason why:
http://social.msdn.microsoft.com/Forums/en-US/vclanguage/thread/766c8ddc-2e83-46f0-b5a1-31acbb6ac2c5/
- Casting float* to __m128 will not work. C++ compiler converts assignment to __m128 type to SSE instruction loading 4 float numbers to SSE register. Assuming that this casting is compiled, it doesn't create working code, because SEE loading instruction is not generated.
__m128 variable is not actually variable or array. This is placeholder for SSE register, replaced by C++ compiler to SSE Assembly instruction. To understand this better, read Intel Assembly Programming Reference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With