There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: <pre class="prettyprint"><code>movaps xmm0, xmmword ptr [rax] vmovaps ymm0, ymmword ptr [rax] vmovaps zmm0, zmmword ptr [rax] </code></pre> B. And the ones that work with unaligned memory addresses, that will not raise such exception: <pre class="prettyprint"><code>movups xmm0, xmmword ptr [rax] vmovups ymm0, ymmword ptr [rax] vmovups zmm0, zmmword ptr [rax] </code></pre> But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions from the first group at all?

<ul> <li>Unaligned access: Only <code>movups/vmovups</code> can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.</li> <li>Aligned access: <ul> <li>On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.</li> <li>On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops. <code>movups/vmovups</code> consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, <code>movups/vmovups</code> can be up to twice as slow as <code>movaps/vmovaps</code> in terms of latency and/or throughput.</li> </ul> </li> </ul> Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.

I think there is a subtle difference between using <code>_mm_loadu_ps</code> and <code>_mm_load_ps</code> even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance. Operations which fold a load and another operation such as multiplication into one instruction can only be done with <code>load</code>, not <code>loadu</code> intrinsics, unless you compile with AVX enabled to allow unaligned memory operands. Consider the following code <pre class="prettyprint"><code>#include <x86intrin.h> __m128 foo(float *x, float *y) { __m128 vx = _mm_loadu_ps(x); __m128 vy = _mm_loadu_ps(y); return vx*vy; } </code></pre> This gets converted to <pre class="prettyprint"><code>movups xmm0, XMMWORD PTR [rdi] movups xmm1, XMMWORD PTR [rsi] mulps xmm0, xmm1 </code></pre> however if the aligned load intrinsics (<code>_mm_load_ps</code>) are used, it's compiled to <pre class="prettyprint"><code>movaps xmm0, XMMWORD PTR [rdi] mulps xmm0, XMMWORD PTR [rsi] </code></pre> which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well. <pre class="prettyprint"><code>vmovups xmm0, XMMWORD PTR [rsi] vmulps xmm0, xmm0, XMMWORD PTR [rdi] </code></pre> Therefor for aligned access although there is no difference in performance when using the instructions <code>movaps</code> and <code>movups</code> on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later. But there can be a difference in performance when using <code>_mm_loadu_ps</code> and <code>_mm_load_ps</code> intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not <code>movaps</code> vs. <code>movups</code>, it's between <code>movups</code> or folding a load into an ALU instruction. (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a <code>mov*</code> load to get the result in a register for reuse.)

Choice between aligned vs. unaligned x86 SIMD instructions

Tags:

x86

avx

simd

sse

avx512

There are generally two types of SIMD instructions:

A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary:

movaps  xmm0, xmmword ptr [rax]
vmovaps ymm0, ymmword ptr [rax]
vmovaps zmm0, zmmword ptr [rax]

B. And the ones that work with unaligned memory addresses, that will not raise such exception:

movups  xmm0, xmmword ptr [rax]
vmovups ymm0, ymmword ptr [rax]
vmovups zmm0, zmmword ptr [rax]

But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions from the first group at all?

565

asked Sep 03 '18 09:09

MikeF

2 Answers

Unaligned access: Only movups/vmovups can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.
Aligned access:
- On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.
- On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops. movups/vmovups consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, movups/vmovups can be up to twice as slow as movaps/vmovaps in terms of latency and/or throughput.

Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.

104

answered Jan 02 '23 03:01

Hadi Brais

I think there is a subtle difference between using _mm_loadu_ps and _mm_load_ps even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance.

Operations which fold a load and another operation such as multiplication into one instruction can only be done with load, not loadu intrinsics, unless you compile with AVX enabled to allow unaligned memory operands.

Consider the following code

#include <x86intrin.h>
__m128 foo(float *x, float *y) {
    __m128 vx = _mm_loadu_ps(x);
    __m128 vy = _mm_loadu_ps(y);
    return vx*vy;
}

This gets converted to

movups  xmm0, XMMWORD PTR [rdi]
movups  xmm1, XMMWORD PTR [rsi]
mulps   xmm0, xmm1

however if the aligned load intrinsics (_mm_load_ps) are used, it's compiled to

movaps  xmm0, XMMWORD PTR [rdi]
mulps   xmm0, XMMWORD PTR [rsi]

which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well.

vmovups xmm0, XMMWORD PTR [rsi]
vmulps  xmm0, xmm0, XMMWORD PTR [rdi]

Therefor for aligned access although there is no difference in performance when using the instructions movaps and movups on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later.

But there can be a difference in performance when using _mm_loadu_ps and _mm_load_ps intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not movaps vs. movups, it's between movups or folding a load into an ALU instruction. (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a mov* load to get the result in a register for reuse.)

answered Jan 02 '23 04:01

Z boson

Related questions
                            
                                How can I execute MIPS assembly programs on an x86 linux?
                            
                                Is increment an integer atomic in x86? [duplicate]
                            
                                x86 where stack pointer points?
                            
                                Need help understanding E8 asm call instruction x86
                            
                                Efficient computation of 2**64 / divisor via fast floating-point reciprocal
                            
                                How do SYSCALL/SYSRET instructions perform across x86 CPUs?
                            
                                Optimizing an arithmetic coder
                            
                                dword ptr usage confusion
                            
                                Visual Studio: Different DLLs for configurations
                            
                                Recompile a x86 code with LLVM to some faster one x86
                            
                                How does MOVSX assembly instruction work?
                            
                                Why does the BIOS entry point start with a WBINVD instruction?
                            
                                How to rotate an SSE/AVX vector
                            
                                Floating multiplication performing slower depending of operands in C
                            
                                Why do some SSE "mov" instructions specify that they move floating-point values?
                            
                                Is it possible to call a non-exported function that resides in an exe?
                            
                                Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
                            
                                Difference between lea and offset
                            
                                "cpuid" before "rdtsc"
                            
                                What is %gs in Assembly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With