I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.
The data I'm using _mm_load_ps on is defined like this:
struct alignas(16) Vector {
float v[4];
}
// often embedded in other structs like this
struct AABB {
Vector min;
Vector max;
bool intersection(/* parameters */) const;
}
Now when I'm using this construct, the following will happen:
// this code
__mm128 bb_min = _mm_load_ps(min.v);
// generates this
movups xmm4, XMMWORD PTR [r8]
I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?
EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?
On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.
When compiling for AVX or higher:
memset()
.The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.
Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.
Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.
New Relevations:
Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.
The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.
So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With