Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make MSVC generate assembly which caches memory in a register?

I have a type mat4 which represents a float[4][4]. Internally it uses a 512-bit register.

union alignas(16 * sizeof(float)) mat4 {
private:
    __m512 m512;
    __m512d m512d;
    ALWAYS_INLINE mat4(__m512 m512) : m512{m512} {}
    ALWAYS_INLINE mat4(__m512d m512d) : m512d{m512d} {}
    ALWAYS_INLINE operator __m512&() { return m512; }
    ALWAYS_INLINE operator __m512d&() { return m512d; }
    ALWAYS_INLINE operator const __m512&() const { return m512; }
    ALWAYS_INLINE operator const __m512d&() const { return m512d; }
    ALWAYS_INLINE mat4& operator=(__m512 _m512) {
        m512 = _m512;
        return *this;
    }
    ALWAYS_INLINE mat4& operator=(__m512d _m512d) {
        m512d = _m512d;
        return *this;
    }

public:
    friend void __vectorcall transform_children(mat4 parent, std::span<mat4> children);
};

I also have a function transform_children(mat4 parent, std::span<mat4> children). It treats all mat4s as transformation matrices and transforms all the children (in place) by multiplying them with the parent. I wrote1 an optimised implementation using AVX512F intrinsics.

void __vectorcall transform_children(mat4 parent, std::span<mat4> children) {
    mat4* const __restrict bs = children.data();
    const size_t n = children.size();

    ASSUME(n != 0);

    const mat4 zmm1 = _mm512_permute_ps(parent, 0);
    const mat4 zmm2 = _mm512_permute_ps(parent, 85);
    const mat4 zmm3 = _mm512_permute_ps(parent, 170);
    const mat4 zmm0 = _mm512_permute_ps(parent, 255);

    for (int i = 0; i < n; ++i) {
        mat4& __restrict zmm4 = bs[i];
        mat4 zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 85);
        zmm5 = _mm512_mul_ps(zmm5, zmm2);
        mat4 zmm6 = _mm512_shuffle_f64x2(zmm4, zmm4, 0);
        zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
        zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 170);
        zmm4 = _mm512_shuffle_f64x2(zmm4, zmm4, 255);
        zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
        zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
    }
}

Both GCC and Clang compile this nearly literally to optimised assembly. Unfortunately, MSVC does something weird. For some reason, instead of loading the value of bs[i] to a register and then storing it back to the array at the end of the iteration, it accesses the memory 4 times:

void transform_children(mat4,std::span<mat4,4294967295>) PROC ; transform_children, COMDAT
        mov     ecx, DWORD PTR _children$[esp]
        vpermilps zmm4, zmm0, 0
        vpermilps zmm5, zmm0, 85                        
        vpermilps zmm6, zmm0, 170                 
        vpermilps zmm7, zmm0, 255                 
        test    ecx, ecx
        je      SHORT $LN36@transform_
        mov     eax, DWORD PTR _children$[esp-4]
        npad    8
$LL4@transform_:
        lea     eax, DWORD PTR [eax+64]
        vmovupd zmm3, ZMMWORD PTR [eax-64]              ; HERE
        vshuff64x2 zmm0, zmm3, zmm3, 85       
        vmulps  zmm0, zmm0, zmm5
        vshuff64x2 zmm1, zmm3, zmm3, 0
        vmovups zmm2, zmm4
        vfmadd213ps zmm2, zmm1, zmm0 
        vshuff64x2 zmm0, zmm3, zmm3, 255 
        vmovupd ZMMWORD PTR [eax-64], zmm0              ; HERE
        vfmadd231ps zmm2, zmm7, ZMMWORD PTR [eax-64]    ; HERE
        vshuff64x2 zmm1, zmm3, zmm3, 170               
        vmovups zmm0, zmm6
        vfmadd213ps zmm0, zmm1, zmm2
        vmovups ZMMWORD PTR [eax-64], zmm0              ; HERE
        sub     ecx, 1
        jne     SHORT $LL4@transform_
$LN36@transform_:
        vzeroupper
        ret     8
void transform_children(mat4,std::span<mat4,4294967295>) ENDP ; transform_children

What could I do to make MSVC access memory only twice, like GCC and Clang2 do?


1. To be precise, GCC and Clang wrote this implementation (sort of). First, I wrote a the typical implementation using two nested loops. Then, I ran it through GCC using -mavx512f. GCC was smart enough to generate optimised vectorised code. Then, I converted this vectorised code from assembly back to C++ using intrinsics. Then, I compiled the new intrinsic code with Clang and it generated an even faster vectorised assembly. Then I converted Clang's assembly to C++ intrinsics again.

2. Clang accesses memory 4 times, but it unrolls the loop, so still two accesses per iteration

like image 576
janekb04 Avatar asked Oct 21 '25 05:10

janekb04


2 Answers

TL:DR: it turns out that MSVC does a bad job when it has to convert between __m512d and __m512 through the overloaded conversions of your mat4 class. So just do everything with __m512 intrinsics, including the shuffling of 128-bit lanes.


MSVC making worse code is unfortunate but not shocking; MSVC's optimizer is well known to be not as good in general. MSVC doesn't do strict-aliasing, although __m512 can alias anything anyway so IDK if that's relevant here.

Seems like you should just use a __m512 (or maybe mat4) temporary variable instead of telling the compiler to modify bs[i] repeatedly and hope it actually doesn't.

Especially across implicit conversion from __m512d (from the pd aka f64 shuffles) to mat4 to __m512 (for single-precision FMAs) and back. _mm512_shuffle_f32x4 is a drop-in replacement for _mm512_shuffle_f64x2; both use the shuffle-control immediate to select 128-bit lanes, and 32 vs. 64-bit element granularity for masking doesn't matter since you're not masking. It's more idiomatic to be using f32x4 shuffles on packed-float data, so generally prefer that anyway.

Writing it like this gets MSVC to make the asm you want; using a __m512 variable required me to make the intrinsics types all match (if I didn't want to sprinkle it with _mm512_castps_pd and pd_ps around the shuffles); in fact that's what first let to me noticing the __m512d vs. __m512 type difference.

    for (int i = 0; i < n; ++i) {
        __m512 zmm4 = bs[i];
        mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
        zmm5 = _mm512_mul_ps(zmm5, zmm2);
        mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
        zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
        zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
        zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
        zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
        zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
        bs[i] = zmm4;
    }

MSVC 19.32 (Godbolt, same as v19.latest) is reloading your zmm0 constant from _zmm0$1$[esp+64] at the bottom of the loop, right before the vmovups [eax-64], zmm1 store into bs[i]. It seems to use ZMM3 as a temporary later in the loop, overwriting the constant. It also has a couple instructions like vmovups zmm1, zmm7.

But that only happens in a 32-bit build like you linked, not a normal 64-bit build like https://godbolt.org/z/GWszEnfP5 where it doesn't spill any vector constants to the stack. (It does save/restore XMM6 and 7, though; IDK if Windows x64 made XMM16..31 all call-preserved like XMM6..15 are. You'd hope not, that's way too many call-preserved registers.) It still only used ZMM0..7, so it could have done that in 32-bit code, it just failed.

GCC targeting 32-bit mode with -mabi=ms doesn't have those wasted zmm to zmm move instructions; it's able to arrange its FMAs to modify zmm4 (in ZMM0) in-place, scheduling the shuffles appropriately so the registers can be reused. (https://godbolt.org/z/9sGbYn71o)

Using the same vector type for all intrinsics also works for MSVC

Even with the reference, we get asm without extra store/reload of zmm4 on Godbolt with x86 MSVC v19.latest after just changing the shuffles to be f32x4.

    for (int i = 0; i < n; ++i) {
        mat4& __restrict zmm4 = bs[i];
        mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
        zmm5 = _mm512_mul_ps(zmm5, zmm2);
        mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
        zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
        zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
        zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
        zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
        zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
        //bs[i] = zmm4;
    }

I think it's more idiomatic to write code as loading a vector into a register, then processing, then storing back to memory. Especially with a name like zmm4, that seems odd for a reference variable; if you're thinking in terms of asm and registers, reference variables aren't a thing. A name like zmm4 doesn't imply any magic that will leave memory updated after modifying a register.

Using a non-reference means you're only modifying a local __m512 (or mat4 if you want to use a non-reference mat4), which is always easier for compilers to optimize into a register. (Although in your loop there aren't any other memory references that it could alias with, even without __restrict.)


BTW, intrinsics let you use slightly meaningful names for vector variables, like vmat, mati, vbsi, or vchild, not zmm4. It's unlikely that the compiler will actually keep your C++ zmm4 variable in the ZMM4 register, so it's more mental effort to compare the asm to the C++ when naming vars this way. e.g. you get instructions like vmovups zmm3, ZMMWORD PTR _zmm0$1$[esp+64]

Using names like zmm0 is usually throwing away one of the clarity / readability advantages of intrinsics over assembly.

In fact you'd prefer the compiler to use ZMM16..31 so it wouldn't need a vzeroupper when it's done. Except you linked a 32-bit build on Godbolt?? That's weird, so you only have ZMM0..7. You linked a 64-bit build for GCC.

like image 100
Peter Cordes Avatar answered Oct 23 '25 21:10

Peter Cordes


By defining these

ALWAYS_INLINE operator __m512&() { return m512; }
ALWAYS_INLINE operator __m512d&() { return m512d; }
ALWAYS_INLINE operator const __m512&() const { return m512; }
ALWAYS_INLINE operator const __m512d&() const { return m512d; }

you technically break the grounding for __restrict: the references returned in different places in the intrinsics using zmm4 point to the same locations, so you are aliasing. It seems that MSVC++ correctly concludes that you are aliasing. Thus the compiler reloads the value from memory each time.

Please, note that your __restrict here says about the this reference of a mat4 object, but not about the references returned by the conversion operators quoted above:

mat4& __restrict zmm4 = bs[i];

Not only are you aliasing, but you are also punning the type (though in a legal way - through a union).

The best solution should be to use the casting intrinsics, as well as store the temporary values in a dedicated const variable. This way you should get the optimizations.

like image 21
Serge Rogatch Avatar answered Oct 23 '25 21:10

Serge Rogatch



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!