Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to cast floats directly to __m128 if they are 16 byte aligned?

Is it safe/possible/advisable to cast floats directly to __m128 if they are 16 byte aligned?

I noticed using _mm_load_ps and _mm_store_ps to "wrap" a raw array adds a significant overhead.

What are potential pitfalls I should be aware of?

EDIT :

There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128 instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps instruction, probably falling back to some fail safe code path.

like image 755
dtech Avatar asked Aug 01 '12 12:08

dtech


3 Answers

What makes you think that _mm_load_ps and _mm_store_ps "add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).

like image 87
Paul R Avatar answered Nov 15 '22 17:11

Paul R


There are several ways to put float values into SSE registers; the following intrinsics can be used:

__m128 sseval;
float a, b, c, d;

sseval = _mm_set_ps(a, b, c, d);  // make vector from [ a, b, c, d ]
sseval = _mm_setr_ps(a, b, c, d); // make vector from [ d, c, b, a ]
sseval = _mm_load_ps(&a);         // ill-specified here - "a" not float[] ...
                                  // same as _mm_set_ps(a[0], a[1], a[2], a[3])
                                  // if you have an actual array

sseval = _mm_set1_ps(a);          // make vector from [ a, a, a, a ]
sseval = _mm_load1_ps(&a);        // load from &a, replicate - same as previous

sseval = _mm_set_ss(a);           // make vector from [ a, 0, 0, 0 ]
sseval = _mm_load_ss(&a);         // load from &a, zero others - same as prev

The compiler will often create the same instructions no matter whether you state _mm_set_ss(val) or _mm_load_ss(&val) - try it and disassemble your code.

It can, in some cases, be advantageous to write _mm_set_ss(*valptr) instead of _mm_load_ss(valptr) ... depends on (the structure of) your code.

like image 21
FrankH. Avatar answered Nov 15 '22 16:11

FrankH.


Going by http://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx, it's possible but not safe or recommended.

You should not access the __m128 fields directly.


And here's the reason why:

http://social.msdn.microsoft.com/Forums/en-US/vclanguage/thread/766c8ddc-2e83-46f0-b5a1-31acbb6ac2c5/

  1. Casting float* to __m128 will not work. C++ compiler converts assignment to __m128 type to SSE instruction loading 4 float numbers to SSE register. Assuming that this casting is compiled, it doesn't create working code, because SEE loading instruction is not generated.

__m128 variable is not actually variable or array. This is placeholder for SSE register, replaced by C++ compiler to SSE Assembly instruction. To understand this better, read Intel Assembly Programming Reference.

like image 40
JAB Avatar answered Nov 15 '22 17:11

JAB