My CPU is AMD Ryzen 7 7840H which supports AVX-512 instruction set. When I run the .NET8 program, the value of Vector512.IsHardwareAccelerated is true. But System.Numerics.Vector<T> is still 256-bit, and does not reach 512-bit.
Why doesn't the Vector<T> type reach 512 bits in length? Is it currently unsupported, or do I need to tweak the configuration?
Example code:
TextWriter writer = Console.Out;
writer.WriteLine(string.Format("Vector512.IsHardwareAccelerated:\t{0}", Vector512.IsHardwareAccelerated));
writer.WriteLine(string.Format("Vector.IsHardwareAccelerated:\t{0}", Vector.IsHardwareAccelerated));
writer.WriteLine(string.Format("Vector<byte>.Count:\t{0}\t# {1}bit", Vector<byte>.Count, Vector<byte>.Count * 8));
Test results:
Vector512.IsHardwareAccelerated: True
Vector.IsHardwareAccelerated: True
Vector<byte>.Count: 32 # 256bit
See https://github.com/dotnet/runtime/issues/92189 - For the same hardware reason that C compilers default to -mprefer-vector-width=256 when auto-vectorizing large loops, C# doesn't automatically make all vectorized code use 512-bit even if it's available.
Also, for small problems, e.g. 9 floats, it could mean no vectorized iterations happen, just scalar fallback code.
Also, apparently some code-bases (hopefully accidentally) depend on Vector not being wider than 32-byte, so it would be a breaking change for those.
@stephentoub wrote: In .NET 8, the variable-width
Vector<T>will not automatically support widths greater than 256 bits. It's likely in .NET 9 you'll be able to opt-in to that, but at present it's not clear whether it'll be enabled by default,
I commented on the dotnet github issue with some details about the CPU-hardware reasons; I'll reproduce some of that here:
-mprefer-vector-width=512 vs. 256 on Ice Lake Xeon. But again, that's LLVM auto-vectorization of scalar code, not like C# where this would only affect manually-vectorized loops, so the tuning considerations are somewhat different from -mprefer-vector-width=256.In a program that frequently wakes up for short bursts of computation, its AVX-512 usage will still lower turbo frequency for the core, affecting other programs.
Things are different on Zen 4; they handle 512-bit vectors by taking extra cycles in the execution units, so as long as 512-bit vectors don't require more shuffling work or some other effect that would add overhead, 512-bit vectors are a good win for front-end throughput and how far ahead out-of-order exec can see in terms of elements or scalar iterations. (Since a 512-bit uop is still only 1 uop for the front-end.) GCC and Clang default to -mprefer-vector-width=512 for -march=znver4.
There's no turbo penalty or other inherent downsides to 512-bit vectors on Zen 4 (AFAIK; I don't know how misaligned loads perform). It's just a matter of whether software can use them efficiently (without needing more bloated code for loop prologues / epilogues, e.g. scalar cleanup if a masked final iteration doesn't Just Work.) AVX-512 masked stores are efficient on Zen 4, despite the fact that AVX1/2 vmaskmovps / vpmaskmovd aren't. (https://uops.info/)
For code where you have exactly 32 bytes of something, if the 32-byte vectors are no longer an option then that's a loss. C#'s scalable vector-length model isn't ideal for those cases. ARM SVE or RISC-V Vector extensions where the hardware ISA are designed around a variable vector-length with masking to handle vectors shorter than the HW's native length, but doing the same thing for C# Vector<> probably wouldn't work well because lots of hardware (x86 with AVX2, or AArch64 without SVE) can't efficiently support masking for arbitrary-length stuff.
I wrote more about Intel on the github issue, which I'm not going to copy/paste all of here.
There can be significant overall throughput gains from 512-bit vectors for some workloads on Intel CPUs. But it comes with downsides, like more expensive misaligned memory access.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With