I am rewriting a high performance C++ application to C#. The C# app is noticeably slower than the C++ original. Profiling tells me that the C# app spends most time in accessing array elements. Hence I create a simple array access benchmark. I get completely different results than others doing a similiar comparison.
The C++ code:
#include <limits>
#include <stdio.h>
#include <chrono>
#include <iostream>
using namespace std;
using namespace std::chrono;
int main(void)
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int xRepLen = 100 * 1000;
int xRepCount = 1000;
unsigned short * xArray = new unsigned short[xRepLen];
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
xArray[xIdx] = xIdx % USHRT_MAX;
int * xResults = new int[xRepLen];
for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
{
// in each repetition, find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
{
unsigned short xValToBreach = (xArray[xIdx] + 25) % USHRT_MAX;
xResults[xIdx] = 0;
for (int xIdx2 = xIdx + 1; xIdx2 < xRepLen; xIdx2++)
if (xArray[xIdx2] >= xValToBreach)
{
xResults[xIdx] = xIdx2; break;
}
if (xResults[xIdx] == 0)
xResults[xIdx] = INT_MAX;
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "Elasped miliseconds " << duration;
getchar();
}
The C# code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace arrayBenchmarkCs
{
class Program
{
public static void benchCs()
{
unsafe
{
int xRepLen = 100 * 1000;
int xRepCount = 1000;
ushort[] xArr = new ushort[xRepLen];
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
xArr[xIdx] = (ushort)(xIdx % 0xffff);
int[] xResults = new int[xRepLen];
Stopwatch xSw = new Stopwatch(); xSw.Start();
fixed (ushort * xArrayStart = & xArr [0])
{
for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
{
// in each repetition, go find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
ushort * xArrayEnd = xArrayStart + xRepLen;
for (ushort* xPtr = xArrayStart; xPtr != xArrayEnd; xPtr++)
{
ushort xValToBreach = (ushort)((*xPtr + 25) % 0xffff);
int xResult = -1;
for (ushort * xPtr2 = xPtr + 1; xPtr2 != xArrayEnd; xPtr2++)
if ( *xPtr2 >= xValToBreach)
{
xResult = (int)(xPtr2 - xArrayStart);
break;
}
if (xResult == -1)
xResult = int.MaxValue;
// save result
xResults[xPtr - xArrayStart] = xResult;
}
}
} // fixed
xSw.Stop();
Console.WriteLine("Elapsed miliseconds: " + (xSw.ElapsedMilliseconds.ToString("0"));
}
}
static void Main(string[] args)
{
benchCs();
Console.ReadKey();
}
}
}
On my work computer (i7-3770), the C++ version is approx 2x faster than the C# version. On my home computer (i7-5820K) the C++ is 1.5x faster than the C# version. Both are measured in Release. I hoped that by using pointers in C# I would avoid the array boundary checking and the performance would be the same in both languages.
So my questions are the following:
Any hint is much appreciated, Daniel
Optimizing this kind of code requires using SIMD. That means processing multiple data items at a time (vectorization).
For C, C++ or Rust, see for example https://matklad.github.io/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html . Using these techniques relies on autovectorization.
C# does not appear to support autovectorization, but this particular inner loop can be implemented using MemoryExtensions.IndexOfAnyInRange documented at https://learn.microsoft.com/en-us/dotnet/api/system.memoryextensions.indexofanyinrange?view=net-8.0 which has a vectorized implementation for various primitive types. This results in a redundant comparison to ushort.MaxValue but in my quick benchmark it is still almost twice as fast as the simple loop with pointers. It needs fewer lines of code as well.
Sample loop:
public int[] UsingSpan() {
ushort[] a = xArr;
int[] r = xResults;
for (int i = 0; i < a.Length; i++) {
ushort xValToBreach = (ushort)((a[i] + 25) % 0xffff);
int xResult = a.AsSpan(i + 1).IndexOfAnyInRange(xValToBreach, ushort.MaxValue);
r[i] = xResult == -1 ? int.MaxValue : xResult + i + 1;
}
return r;
}
An if (a.Length != r.Length) throw new Exception(); can eliminate the bounds check on r and trickery using MemoryMarshal.CreateReadOnlySpan and Unsafe.Add can eliminate the bounds check from AsSpan, but these provide little benefit.
The % 0xffff operation is not optimized very well in .NET 9: it does not use a division instruction but still needs many instructions.
Alternatively, use generic explicit SIMD support such as https://learn.microsoft.com/en-us/dotnet/api/system.numerics.vector.greaterthan?view=net-8.0#system-numerics-vector-greaterthan-1(system-numerics-vector((-0))-system-numerics-vector((-0))) in C#. I have not tried this.
Alternatively, use intrinsics specific to the CPU and SIMD instruction set. This can be done in C, C++, Rust and C#.
By the way, x86-64 clang 19.1.0 with -O2 on godbolt optimizes out the entire C++ benchmark, since it sees the values stored to xResults are never read (it also needs adding #include <climits>).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With