Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does __m128 cause alignment issues in a union with float x/y/z?

I've never actually ran into this problem before, at least not that I'm aware of... But I'm working on some SIMD vector optimizations in some of my code and I'm having some alignment issues.

Here's some minimal code that I've been able to reproduce the problem with, on MSVC (Visual Studio 2022):

#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
#include <xmmintrin.h>

_declspec(align(16)) typedef union
{
    struct { float x, y, z; };

#if 0
    // This works:
    float v[4];
#else
    // This does not:
    __m128 v;
#endif
} vec;

typedef struct
{
    vec pos;
    vec vel;
    float radius;
} particle;

int main(int argc, char **argv)
{
    particle *particles=malloc(sizeof(particle)*10);

    if(particles==NULL)
        return -1;

    // intentionally misalign the pointer
    ((uint8_t *)particles)+=3;

    printf("misalignment: %lld\n", (uintptr_t)particles%16);

    particles[0].pos=(vec){ 1.0f, 2.0f, 3.0f };
    particles[0].vel=(vec){ 4.0f, 5.0f, 6.0f };

    printf("pos: %f %f %f\nvel: %f %f %f\n",
           particles[0].pos.x, particles[0].pos.y, particles[0].pos.z,
           particles[0].vel.x, particles[0].vel.y, particles[0].vel.z);

    return 0;
}

I don't understand why a union with float x/y/z and float[4] works with misaligned memory addresses, but a union with the float x/y/z and an __m128 generates an access violation. I get that the __m128 type has some extra alignment specs on it, but the overall union size doesn't change and it's also 16 byte aligned anyway, so why does it matter?

I do understand the importance of memory alignment, but the extra weird part is that I added in an aligned_malloc to my code that's allocating the offending misaligned memory (I use a slab/zone memory allocator in my code) and it still continued to crash out with an access violation, which further adds to my hair loss.

like image 212
Seishuku Avatar asked Oct 15 '25 19:10

Seishuku


1 Answers

alignof(your_union) is 16 when it includes a __m128 member, so compilers will use movaps or movdqa because you've promised them that the data is aligned. Otherwise alignof(your_union) is only 4 (inherited from float, so they'll use movups or movdqu which has no alignment requirement.

It's still alignment undefined behaviour, as gcc -fsanitize=undefined will tell you, since you're using an address that's not even aligned by 4.

https://godbolt.org/z/6GxebxT7r shows MSVC is using movdqa stores for your code, like movdqa [rbx+19], xmm2 where RBX holds a malloc return value. This is guaranteed to fault because malloc return values are aligned by alignof(max_align_t), which is definitely an even number and usually 16 in x86-64.

Often MSVC will only use unaligned movdqu / movups loads/stores even when you use _mm_store_ps. (But alignment-required intrinsics will let it fold the load into a memory source operand for non-AVX instructions like addps xmm0, [rcx]).

But apparently MSVC treats aggregates differently from deref of a __m128*.

So your type has alignof(T) == 16, and thus your code has alignment UB, so it can and does compile to asm that faults.


BTW, I wouldn't recommend using this union; especially not for function args / return values since being part of an aggregate can make the calling conventions treat it less efficiently. (On MSVC you have to use vectorcall to get it passed in a register if it doesn't inline, but x86-64 System V passes vector args in vector regs normally, if they aren't part of a union.)

Use __m128 vectors and write helper functions to get your data in/out as scalar.

Ideally don't use 1 SIMD vector to hold 1 geometry vector, that's kind of an anti-pattern since it leads to a lot of shuffling. Better to have arrays of x, arrays of y, and arrays of z, so you can load 3 vectors of data and process 4 vectors in parallel with no shuffling. (Struct-of-Arrays rather than Array-of-Structs). See https://stackoverflow.com/tags/sse/info especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/

Or if you really want to do it this way, you could still improve this. Your struct particle is 36 bytes as you've defined it, with two wasted 32-bit float slots. It could have been 32 bytes: xyz, radius, xyz, zeroed padding, so you could have alignof(particle) == 16 without increasing the size to 48 bytes, to be able to load it efficiently (never spanning cache-line boundaries). The radius would get loaded as high garbage along _mm_load_ps(&particle->pos_x) which gets the x,y,z positions and whatever comes next. You might sometimes have to use an extra instruction to zero out the high element, but probably most of the time you could be shuffling in ways that don't care about it.

Actually your struct particle is 48 bytes when you have a __m128 member, since it inherits the alignof(T) from its vec pos and vec vel members, and sizeof(T) has to be a multiple of alignof(T) (so arrays work).

like image 65
Peter Cordes Avatar answered Oct 18 '25 11:10

Peter Cordes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!