Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Vector<T> for SIMD in Universal Windows Platform

I'm trying to use System.Numerics.Vector(T) to vectorize an algorithm and take advantage of SIMD operations of the CPU. However, my vector implementation was substantially slower than my original implementation. Is there any trick to using Vectors that may not have been documented? The specific use here is to try to speed up Xors of kb of data.

Unfortunately, almost all of the documentation I can find on it is based on the a pre-release version of RyuJIT, and I don't know how much of that material is portable to .NET Native.

When I inspect the disassembly during a Vector xor operation, it shows:

00007FFB040A9C10  xor         eax,eax  
00007FFB040A9C12  mov         qword ptr [rcx],rax  
00007FFB040A9C15  mov         qword ptr [rcx+8],rax  
00007FFB040A9C19  mov         rax,qword ptr [r8]  
00007FFB040A9C1C  xor         rax,qword ptr [rdx]  
00007FFB040A9C1F  mov         qword ptr [rcx],rax  
00007FFB040A9C22  mov         rax,qword ptr [r8+8]  
00007FFB040A9C26  xor         rax,qword ptr [rdx+8]  
00007FFB040A9C2A  mov         qword ptr [rcx+8],rax  
00007FFB040A9C2E  mov         rax,rcx  

Why doesn't it use the xmm registers and SIMD instructions for this? What's also odd is that SIMD instructions were generated for a version of this code that I hadn't explicitly vectorized, but they were never being executed, in favor of the regular registers and instructions.

I ensured that I was running with Release, x64, Optimize code enabled. I saw similar behavior with x86 compilation. I'm somewhat novice at machine-level stuff, so its possible there's just something going on here that I'm not properly understanding.

Framework version is 4.6, Vector.IsHardwareAccelerated is false at runtime.

Update: "Compile with .NET Native tool chain" is the culprit. Enabling it causes Vector.IsHardwareAccelerated == false; Disabling it causes Vector.IsHardwareAccelerated == true. I've confirmed that when .NET Native is disabled, the compiler is producing AVX instructions using the ymm registers. Which leads to the question... why is SIMD not enabled in .NET Native? And is there any way to change that?

Update Tangent: I discovered that the reason the auto-SSE-vectorized array code wasn't being executed was because the compiler had inserted an instruction that looked to see if the start of the array was at a lower address than one of the last elements of the array, and if it was, to just use the normal registers. I think that must be a bug in the compiler, because the start of an array should always be at a lower address than its last elements by convention. It was part of a set of instructions testing the memory addresses of each of the operand arrays, I think to make sure they were non-overlapping. I've filed a Microsoft Connect bug report for this: https://connect.microsoft.com/VisualStudio/feedback/details/1831117

like image 305
Nick Bauer Avatar asked Sep 20 '15 21:09

Nick Bauer


People also ask

What's new in the SIMD vector types?

We’ve just released a major update to the SIMD vector types we’ve blogged about earlier ( first announcement, second update ). We’ve now made the vector library a lot more more useful for typical graphics operations. We added matrix types, a plane type, and a quaternion type.

How to enhance the vector-support of OpenMP SIMD?

The OpenMP SIMD directive can also take the following clauses to enhance vector-support: Specify the number of vector lanes. Specify the vector dependency distance. The linear mapping from loop induction variable to array subscription. The alignment of data. Specify data privatization.

Is it possible to implement vectorized code with SIMD in JIT?

.net - Vectorized C# code with SIMD using Vector<T> running slower than classic loop - Stack Overflow I've seen a few articles describing how Vector&lt;T&gt; is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowin...

Is vector<t> SIMD-enabled?

3 I've seen a few articles describing how Vector<T>is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowing much faster code than classic, linear loops (example here).


1 Answers

I contacted Microsoft, who posted a contact address for .Net Native questions and concerns: https://msdn.microsoft.com/en-us/vstudio/dotnetnative.aspx

My question was referred to Ian Bearman, Principal Software Engineering Manager in the Microsoft Code Generation and Optimization Technologies Team:

Currently .NET Native does not optimize the System.Numerics library and relies on the default library implementation. This may (read: will likely) result in code written using System.Numerics to not perform as well in .NET Native as it will against other CLR implementations.

While this is unfortunate, .NET Native does support auto-vectorization which comes with using the C++ optimizations mentioned above. The current shipping .NET Native compiler supports SSE2 ISA in its auto-vectorization on x86 and x64 and NEON ISA on ARM.

He also mentioned that they want to bring over from the C++ compiler the ability to generate all vector instructions (AVX, SSE, etc.) and branch based on detection of the instruction set at runtime.

He then suggested that if usage of instructions is really critical, the component can be built in C++, which has access to the compiler intrinsics (and presumably this branching ability?) and then easily interfaced to the remaining C# application.

As for the skipped-over SSE2 instructions, all I needed to do to get it to compile to the right instructions was to replace a looped "a = a ^ b" with "a ^= b". Since they should be equivalent expressions, it appears that it is a bug, but fortunately one with a workaround.

like image 136
Nick Bauer Avatar answered Nov 15 '22 18:11

Nick Bauer