I'm playing around with .NET Core 3.0's new support for hardware intrinsics, in the System.Runtime.Intrinsics namespace.
I have some code where I perform 4 XOR operations in a loop - below is a simplified example (I didn't write this in an IDE, so please ignore any syntax mistakes:
private static unsafe ulong WyHashCore(byte[] array)
{
fixed (byte* pData = array)
{
byte* ptr = pData;
// Consume 32-byte chunks
for (int i = 0; i < array.Length; i += 32)
{
ulong a = Read64(ptr, i);
ulong b = Read64(ptr, i + 8);
ulong c = Read64(ptr, i + 16);
ulong d = Read64(ptr, i + 24);
// XOR them with some constants
ulong xor1 = a ^ SOME_CONSTANT1;
ulong xor2 = b ^ SOME_CONSTANT2;
ulong xor3 = c ^ SOME_CONSTANT3;
ulong xor4 = d ^ SOME_CONSTANT4;
// Use the resulting values
}
}
}
The Read64
method looks like:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static unsafe ulong Read64(byte* ptr, int start)
=> *(ulong*)(ptr + start);
I tried replacing the 4 XOR lines with:
byte[] array; // An array from elsewhere
private static unsafe ulong WyHashCore(byte[] array)
{
var bVector = Vector256.Create(SOME_CONSTANT1, SOME_CONSTANT2, SOME_CONSTANT3, SOME_CONSTANT4);
fixed (byte* pData = array)
{
byte* ptr = pData;
// Consume 32-byte chunks
for (int i = 0; i < array.Length; i += 32)
{
ulong a = Read64(ptr, i);
ulong b = Read64(ptr, i + 8);
ulong c = Read64(ptr, i + 16);
ulong d = Read64(ptr, i + 24);
// Create a 256-bit vector from the 4 64-bit integers
var aVector = Vector256.Create(a, b, c, d);
// XOR the 2 vectors
var res = Avx2.Xor(aVector, bVector);
// Get the resulting values out of the result vector
ulong xor1 = res.GetElement(0);
ulong xor2 = res.GetElement(1);
ulong xor3 = res.GetElement(2);
ulong xor4 = res.GetElement(3);
// Use the resulting values
}
}
}
This does give the expected results - but it's 15x slower than just multiplying the scalars!
Am I going wrong somewhere, or mis-using SIMD?
** Update ** I've updated the code to use the 'correct' ways of loading and offloading data to/from the vector, and it's now around 3.75x faster than the scalar code!
byte[] array; // An array from elsewhere
private static readonly Vector256<ulong> PrimeVector = Vector256.Create(SOME_CONSTANT1, SOME_CONSTANT2, SOME_CONSTANT3, SOME_CONSTANT4);
private static unsafe ulong WyHashCore(byte[] array)
{
// Create space on the stack to hold XOR results
var xorResult = stackalloc ulong[4];
fixed (byte* pData = array)
{
byte* ptr = pData;
// Consume 32-byte chunks
for (int i = 0; i < array.Length; i += 32)
{
// Create a 256-bit vector from the 4 64-bit integers
var vector = Avx.LoadVector256((ulong*)(ptr + i));
// XOR the 2 vectors
var res = Avx2.Xor(vector, PrimeVector);
// Store the resulting vector in memory
Avx2.Store(xorResult, res);
// Get the resulting values out of the result vector
var xor1 = *xorResult;
var xor2 = *(xorResult + 1);
var xor3 = *(xorResult + 2);
var xor4 = *(xorResult + 3);
// Use the resulting values
}
}
}
TL;DR The AVX2 HW intrinsics are used incorrectly what results in generation of very inefficient SIMD code.
The error lies in a way the instructions load, operate on and store data in a buffer. The operation should be performed using AVX/AVX2 Avx2.Xor intrinsics with memory what will speedup load time 4x and would return Vector256. This on the other hand would make call to Vector256.Create redundant and would speedup execution further. Finally, the data should be stored in the array by using Avx2.Store() intrinsic. This again would speedup the code roughly 4x.
The last optimization which should be applied is an exploitation of CPU instruction level parallelism. Usually SIMD instructions are executed over predefined number of CPU cycles with latency which maybe larger than 1 CPU cycle. These parameters are CPU specific and can be found in:
As all optimizations which could be applied are quite complex I will explain them in longer writeup a bit later but in general I would expect up to 4x speedup due to vectorisation in comparison to base case for the problem you are working on.
The code sample you are using is a simple loop modifying data in quad unsigned quadword steps and is a perfect candidate for autovectorization by optimizing compilers. When identical C++ loop is optimized by GCC 9.1 with options -O3 -march=haswell the resulting machine code shows all standard optimizations applied to the loop:
#include <cstdint>
void hash(uint64_t* buffer, uint64_t length) {
uint64_t* pBuffer = buffer;
const uint64_t CONST1 = 0x6753ul;
const uint64_t CONST2 = 0x7753ul;
const uint64_t CONST3 = 0x8753ul;
const uint64_t CONST4 = 0x9753ul;
for(uint64_t i = 0; i < length; i += 4)
{
*pBuffer ^= CONST1;
*(pBuffer + 1) ^= CONST2;
*(pBuffer + 2) ^= CONST3;
*(pBuffer + 3) ^= CONST4;
}
}
Godbolt Compiler Explorer result GCC 9.1
test rsi, rsi
je .L11
cmp rsi, -4
ja .L6
lea rdx, [rsi-1]
vmovdqa ymm1, YMMWORD PTR .LC0[rip]
xor eax, eax
shr rdx, 2
inc rdx
.L5:
vpxor ymm0, ymm1, YMMWORD PTR [rdi]
inc rax
add rdi, 32
vmovdqu YMMWORD PTR [rdi-32], ymm0
cmp rax, rdx
jb .L5
vzeroupper
.L11:
ret
.L6:
vmovdqa ymm1, YMMWORD PTR .LC0[rip]
xor eax, eax
.L3:
vpxor ymm0, ymm1, YMMWORD PTR [rdi]
add rax, 4
add rdi, 32
vmovdqu YMMWORD PTR [rdi-32], ymm0
cmp rsi, rax
ja .L3
vzeroupper
jmp .L11
.LC0:
.quad 26451
.quad 30547
.quad 34643
.quad 38739
Godbolt Compiler Explorer result Clang 8.0
.LCPI0_0:
.quad 26451 # 0x6753
.quad 30547 # 0x7753
.quad 34643 # 0x8753
.quad 38739 # 0x9753
hash(unsigned long*, unsigned long): # @hash(unsigned long*, unsigned long)
test rsi, rsi
je .LBB0_3
xor eax, eax
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = [26451,30547,34643,38739]
.LBB0_2: # =>This Inner Loop Header: Depth=1
vxorps ymm1, ymm0, ymmword ptr [rdi + 8*rax]
vmovups ymmword ptr [rdi + 8*rax], ymm1
add rax, 4
cmp rax, rsi
jb .LBB0_2
.LBB0_3:
vzeroupper
ret
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With