Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visual Studio 2017: _mm_load_ps often compiled to movups

I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.

The data I'm using _mm_load_ps on is defined like this:

struct alignas(16) Vector {
    float v[4];
}

// often embedded in other structs like this
struct AABB {
    Vector min;
    Vector max;
    bool intersection(/* parameters */) const;
}

Now when I'm using this construct, the following will happen:

// this code
__mm128 bb_min = _mm_load_ps(min.v);

// generates this
movups  xmm4, XMMWORD PTR [r8]

I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?

EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?

like image 729
B_old Avatar asked Mar 09 '17 13:03

B_old


1 Answers

On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.

When compiling for AVX or higher:

  • The Microsoft compiler (>VS2013?) doesn't generate aligned loads. But it still generates aligned stores.
  • The Intel compiler (> Parallel Studio 2012?) doesn't do it at all anymore. But you'll still see them in ICC-compiled binaries inside their hand-optimized libraries like memset().
  • As of GCC 6.1, it still generates aligned load/stores when you use the aligned intrinsics.

The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.

Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.

Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.

New Relevations:

  • Starting Parallel Studio 2018, the Intel Compiler no longer generates aligned moves at all - even for pre-Nehalem targets.
  • Starting from Visual Studio 2017, the Microsoft Compiler also no longer generates aligned moves at all - even when targeting pre-AVX hardware.

Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.


The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.

So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.

like image 173
Mysticial Avatar answered Nov 14 '22 22:11

Mysticial