Using Vector<T> for SIMD in Universal Windows Platform

Tags:

I'm trying to use System.Numerics.Vector(T) to vectorize an algorithm and take advantage of SIMD operations of the CPU. However, my vector implementation was substantially slower than my original implementation. Is there any trick to using Vectors that may not have been documented? The specific use here is to try to speed up Xors of kb of data.

Unfortunately, almost all of the documentation I can find on it is based on the a pre-release version of RyuJIT, and I don't know how much of that material is portable to .NET Native.

When I inspect the disassembly during a Vector xor operation, it shows:

00007FFB040A9C10  xor         eax,eax  
00007FFB040A9C12  mov         qword ptr [rcx],rax  
00007FFB040A9C15  mov         qword ptr [rcx+8],rax  
00007FFB040A9C19  mov         rax,qword ptr [r8]  
00007FFB040A9C1C  xor         rax,qword ptr [rdx]  
00007FFB040A9C1F  mov         qword ptr [rcx],rax  
00007FFB040A9C22  mov         rax,qword ptr [r8+8]  
00007FFB040A9C26  xor         rax,qword ptr [rdx+8]  
00007FFB040A9C2A  mov         qword ptr [rcx+8],rax  
00007FFB040A9C2E  mov         rax,rcx

Why doesn't it use the xmm registers and SIMD instructions for this? What's also odd is that SIMD instructions were generated for a version of this code that I hadn't explicitly vectorized, but they were never being executed, in favor of the regular registers and instructions.

I ensured that I was running with Release, x64, Optimize code enabled. I saw similar behavior with x86 compilation. I'm somewhat novice at machine-level stuff, so its possible there's just something going on here that I'm not properly understanding.

Framework version is 4.6, Vector.IsHardwareAccelerated is false at runtime.

Update: "Compile with .NET Native tool chain" is the culprit. Enabling it causes Vector.IsHardwareAccelerated == false; Disabling it causes Vector.IsHardwareAccelerated == true. I've confirmed that when .NET Native is disabled, the compiler is producing AVX instructions using the ymm registers. Which leads to the question... why is SIMD not enabled in .NET Native? And is there any way to change that?

Update Tangent: I discovered that the reason the auto-SSE-vectorized array code wasn't being executed was because the compiler had inserted an instruction that looked to see if the start of the array was at a lower address than one of the last elements of the array, and if it was, to just use the normal registers. I think that must be a bug in the compiler, because the start of an array should always be at a lower address than its last elements by convention. It was part of a set of instructions testing the memory addresses of each of the operand arrays, I think to make sure they were non-overlapping. I've filed a Microsoft Connect bug report for this: https://connect.microsoft.com/VisualStudio/feedback/details/1831117

305

asked Sep 20 '15 21:09

Nick Bauer

1 Answers

I contacted Microsoft, who posted a contact address for .Net Native questions and concerns: https://msdn.microsoft.com/en-us/vstudio/dotnetnative.aspx

My question was referred to Ian Bearman, Principal Software Engineering Manager in the Microsoft Code Generation and Optimization Technologies Team:

Currently .NET Native does not optimize the System.Numerics library and relies on the default library implementation. This may (read: will likely) result in code written using System.Numerics to not perform as well in .NET Native as it will against other CLR implementations.

While this is unfortunate, .NET Native does support auto-vectorization which comes with using the C++ optimizations mentioned above. The current shipping .NET Native compiler supports SSE2 ISA in its auto-vectorization on x86 and x64 and NEON ISA on ARM.

He also mentioned that they want to bring over from the C++ compiler the ability to generate all vector instructions (AVX, SSE, etc.) and branch based on detection of the instruction set at runtime.

He then suggested that if usage of instructions is really critical, the component can be built in C++, which has access to the compiler intrinsics (and presumably this branching ability?) and then easily interfaced to the remaining C# application.

As for the skipped-over SSE2 instructions, all I needed to do to get it to compile to the right instructions was to replace a looped "a = a ^ b" with "a ^= b". Since they should be equivalent expressions, it appears that it is a bug, but fortunately one with a workaround.

136

answered Nov 15 '22 18:11

Nick Bauer

Related questions
                            
                                Using Moq setting up all methods of a mock equally
                            
                                Does using parentheses around return values provide any compiler-related benefits?
                            
                                Get folder path from Explorer window
                            
                                Launch wpf Application on Windows startup
                            
                                Get DataGrid row by index
                            
                                Why is my C# quicksort implementation significantly slower than List<T>.Sort
                            
                                LINQ to Entities equivalent of sql "TOP(n) WITH TIES"
                            
                                Remove InternalsVisibleTo for production
                            
                                Can reserved memory cause an Out Of memory exception
                            
                                Issue with kpm restore when trying to run vNext sample app
                            
                                Cannot pass a GCHandle across AppDomains: solution without delegates?
                            
                                Problems getting WinForms to scale correctly with DPI
                            
                                How to deal with allocations in constrained execution regions?
                            
                                Field annotated multiple times by the same attribute
                            
                                Why do HttpClient.PostAsync and PutAsync dispose the content?
                            
                                Type constraints in Attributes
                            
                                Extreme performance difference when using DataTable.Add
                            
                                Why does a generic class implementing a generic interface with type constraints need to repeat these constraints?
                            
                                CrystalReports ReportDocument memory leak with database connections
                            
                                Computing the high bits of a multiplication in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Vector<T> for SIMD in Universal Windows Platform

Tags:

.net

assembly

simd

uwp

.net-native

Nick Bauer

People also ask

1 Answers

Nick Bauer

Recent Activity

Donate For Us