If you look in the source code of the System.Numerics.Matrix4x4 class of .NET under multiply and other functions, it does an if check to see if hardware supports respectively:
if (AdvSimd.Arm64.IsSupported) {} else if (Sse.IsSupported) {}
But the generic System.Numerics.Vector<T> struct seems to do all the same, what is the difference? Does Vector<T> not simply look behind the scenes and use whichever is available, and then a software fallback if none of them are?
C# System.Numerics Vector<T> generic SIMD doesn't expose all the shuffles and other ISA-specific things like x86 movmskps. If you can get the job done efficiently with the common subset of functionality exposed with the generic API, I'd assume that would be a good choice and still compile to the instructions you'd exepct.
But the function you mentioned uses Sse.Shuffle (shufps) or AdvSimd.Arm64.FusedMultiplyAddBySelectedScalar (?) to broadcast and mul+add. If ARM64 can actually do that in a single instruction (scalar broadcast source for a vector multiply), that's pretty cool. The predecessor to AVX-512 could do that, KNC new instructions in early Xeon Phi, but even AVX-512 needs a shuffle and a separate FMA. (Unless the operand is coming from memory: AVX-512 can use a broadcast memory source operand.)
I don't see any shuffles at all in the docs you linked for System.Numerics, only pure vertical SIMD, so that's not very useful for a 4x4 matrix product where each row[i] needs to get multiplied by a broadcast(col[i]) vector.
So System.Numerics looks way more crippled that GNU C native vectors in C and C++ where there at least is a __builtin_shuffle, but still missing out on special shuffles, and stuff like x86 movmskps to get a scalar bitmap of SIMD compare results. (Which ARM and ARM64 have no direct equivalent for.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With