I have a type mat4 which represents a float[4][4]. Internally it uses a 512-bit register.
union alignas(16 * sizeof(float)) mat4 {
private:
__m512 m512;
__m512d m512d;
ALWAYS_INLINE mat4(__m512 m512) : m512{m512} {}
ALWAYS_INLINE mat4(__m512d m512d) : m512d{m512d} {}
ALWAYS_INLINE operator __m512&() { return m512; }
ALWAYS_INLINE operator __m512d&() { return m512d; }
ALWAYS_INLINE operator const __m512&() const { return m512; }
ALWAYS_INLINE operator const __m512d&() const { return m512d; }
ALWAYS_INLINE mat4& operator=(__m512 _m512) {
m512 = _m512;
return *this;
}
ALWAYS_INLINE mat4& operator=(__m512d _m512d) {
m512d = _m512d;
return *this;
}
public:
friend void __vectorcall transform_children(mat4 parent, std::span<mat4> children);
};
I also have a function transform_children(mat4 parent, std::span<mat4> children). It treats all mat4s as transformation matrices and transforms all the children (in place) by multiplying them with the parent. I wrote1 an optimised implementation using AVX512F intrinsics.
void __vectorcall transform_children(mat4 parent, std::span<mat4> children) {
mat4* const __restrict bs = children.data();
const size_t n = children.size();
ASSUME(n != 0);
const mat4 zmm1 = _mm512_permute_ps(parent, 0);
const mat4 zmm2 = _mm512_permute_ps(parent, 85);
const mat4 zmm3 = _mm512_permute_ps(parent, 170);
const mat4 zmm0 = _mm512_permute_ps(parent, 255);
for (int i = 0; i < n; ++i) {
mat4& __restrict zmm4 = bs[i];
mat4 zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 85);
zmm5 = _mm512_mul_ps(zmm5, zmm2);
mat4 zmm6 = _mm512_shuffle_f64x2(zmm4, zmm4, 0);
zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 170);
zmm4 = _mm512_shuffle_f64x2(zmm4, zmm4, 255);
zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
}
}
Both GCC and Clang compile this nearly literally to optimised assembly. Unfortunately, MSVC does something weird. For some reason, instead of loading the value of bs[i] to a register and then storing it back to the array at the end of the iteration, it accesses the memory 4 times:
void transform_children(mat4,std::span<mat4,4294967295>) PROC ; transform_children, COMDAT
mov ecx, DWORD PTR _children$[esp]
vpermilps zmm4, zmm0, 0
vpermilps zmm5, zmm0, 85
vpermilps zmm6, zmm0, 170
vpermilps zmm7, zmm0, 255
test ecx, ecx
je SHORT $LN36@transform_
mov eax, DWORD PTR _children$[esp-4]
npad 8
$LL4@transform_:
lea eax, DWORD PTR [eax+64]
vmovupd zmm3, ZMMWORD PTR [eax-64] ; HERE
vshuff64x2 zmm0, zmm3, zmm3, 85
vmulps zmm0, zmm0, zmm5
vshuff64x2 zmm1, zmm3, zmm3, 0
vmovups zmm2, zmm4
vfmadd213ps zmm2, zmm1, zmm0
vshuff64x2 zmm0, zmm3, zmm3, 255
vmovupd ZMMWORD PTR [eax-64], zmm0 ; HERE
vfmadd231ps zmm2, zmm7, ZMMWORD PTR [eax-64] ; HERE
vshuff64x2 zmm1, zmm3, zmm3, 170
vmovups zmm0, zmm6
vfmadd213ps zmm0, zmm1, zmm2
vmovups ZMMWORD PTR [eax-64], zmm0 ; HERE
sub ecx, 1
jne SHORT $LL4@transform_
$LN36@transform_:
vzeroupper
ret 8
void transform_children(mat4,std::span<mat4,4294967295>) ENDP ; transform_children
What could I do to make MSVC access memory only twice, like GCC and Clang2 do?
1. To be precise, GCC and Clang wrote this implementation (sort of). First, I wrote a the typical implementation using two nested loops. Then, I ran it through GCC using -mavx512f. GCC was smart enough to generate optimised vectorised code. Then, I converted this vectorised code from assembly back to C++ using intrinsics. Then, I compiled the new intrinsic code with Clang and it generated an even faster vectorised assembly. Then I converted Clang's assembly to C++ intrinsics again.
2. Clang accesses memory 4 times, but it unrolls the loop, so still two accesses per iteration
TL:DR: it turns out that MSVC does a bad job when it has to convert between __m512d and __m512 through the overloaded conversions of your mat4 class. So just do everything with __m512 intrinsics, including the shuffling of 128-bit lanes.
MSVC making worse code is unfortunate but not shocking; MSVC's optimizer is well known to be not as good in general. MSVC doesn't do strict-aliasing, although __m512 can alias anything anyway so IDK if that's relevant here.
Seems like you should just use a __m512 (or maybe mat4) temporary variable instead of telling the compiler to modify bs[i] repeatedly and hope it actually doesn't.
Especially across implicit conversion from __m512d (from the pd aka f64 shuffles) to mat4 to __m512 (for single-precision FMAs) and back. _mm512_shuffle_f32x4 is a drop-in replacement for _mm512_shuffle_f64x2; both use the shuffle-control immediate to select 128-bit lanes, and 32 vs. 64-bit element granularity for masking doesn't matter since you're not masking. It's more idiomatic to be using f32x4 shuffles on packed-float data, so generally prefer that anyway.
Writing it like this gets MSVC to make the asm you want; using a __m512 variable required me to make the intrinsics types all match (if I didn't want to sprinkle it with _mm512_castps_pd and pd_ps around the shuffles); in fact that's what first let to me noticing the __m512d vs. __m512 type difference.
for (int i = 0; i < n; ++i) {
__m512 zmm4 = bs[i];
mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
zmm5 = _mm512_mul_ps(zmm5, zmm2);
mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
bs[i] = zmm4;
}
MSVC 19.32 (Godbolt, same as v19.latest) is reloading your zmm0 constant from _zmm0$1$[esp+64] at the bottom of the loop, right before the vmovups [eax-64], zmm1 store into bs[i]. It seems to use ZMM3 as a temporary later in the loop, overwriting the constant. It also has a couple instructions like vmovups zmm1, zmm7.
But that only happens in a 32-bit build like you linked, not a normal 64-bit build like https://godbolt.org/z/GWszEnfP5 where it doesn't spill any vector constants to the stack. (It does save/restore XMM6 and 7, though; IDK if Windows x64 made XMM16..31 all call-preserved like XMM6..15 are. You'd hope not, that's way too many call-preserved registers.) It still only used ZMM0..7, so it could have done that in 32-bit code, it just failed.
GCC targeting 32-bit mode with -mabi=ms doesn't have those wasted zmm to zmm move instructions; it's able to arrange its FMAs to modify zmm4 (in ZMM0) in-place, scheduling the shuffles appropriately so the registers can be reused. (https://godbolt.org/z/9sGbYn71o)
Even with the reference, we get asm without extra store/reload of zmm4 on Godbolt with x86 MSVC v19.latest after just changing the shuffles to be f32x4.
for (int i = 0; i < n; ++i) {
mat4& __restrict zmm4 = bs[i];
mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
zmm5 = _mm512_mul_ps(zmm5, zmm2);
mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
//bs[i] = zmm4;
}
I think it's more idiomatic to write code as loading a vector into a register, then processing, then storing back to memory. Especially with a name like zmm4, that seems odd for a reference variable; if you're thinking in terms of asm and registers, reference variables aren't a thing. A name like zmm4 doesn't imply any magic that will leave memory updated after modifying a register.
Using a non-reference means you're only modifying a local __m512 (or mat4 if you want to use a non-reference mat4), which is always easier for compilers to optimize into a register. (Although in your loop there aren't any other memory references that it could alias with, even without __restrict.)
BTW, intrinsics let you use slightly meaningful names for vector variables, like vmat, mati, vbsi, or vchild, not zmm4. It's unlikely that the compiler will actually keep your C++ zmm4 variable in the ZMM4 register, so it's more mental effort to compare the asm to the C++ when naming vars this way. e.g. you get instructions like vmovups zmm3, ZMMWORD PTR _zmm0$1$[esp+64]
Using names like zmm0 is usually throwing away one of the clarity / readability advantages of intrinsics over assembly.
In fact you'd prefer the compiler to use ZMM16..31 so it wouldn't need a vzeroupper when it's done. Except you linked a 32-bit build on Godbolt?? That's weird, so you only have ZMM0..7. You linked a 64-bit build for GCC.
By defining these
ALWAYS_INLINE operator __m512&() { return m512; }
ALWAYS_INLINE operator __m512d&() { return m512d; }
ALWAYS_INLINE operator const __m512&() const { return m512; }
ALWAYS_INLINE operator const __m512d&() const { return m512d; }
you technically break the grounding for __restrict: the references returned in different places in the intrinsics using zmm4 point to the same locations, so you are aliasing. It seems that MSVC++ correctly concludes that you are aliasing. Thus the compiler reloads the value from memory each time.
Please, note that your __restrict here says about the this reference of a mat4 object, but not about the references returned by the conversion operators quoted above:
mat4& __restrict zmm4 = bs[i];
Not only are you aliasing, but you are also punning the type (though in a legal way - through a union).
The best solution should be to use the casting intrinsics, as well as store the temporary values in a dedicated const variable. This way you should get the optimizations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With