C++ array vs C# ptr speed confusion

Question

I am rewriting a high performance C++ application to C#. The C# app is noticeably slower than the C++ original. Profiling tells me that the C# app spends most time in accessing array elements. Hence I create a simple array access benchmark. I get completely different results than others doing a similiar comparison.

The C++ code:

#include <limits>
#include <stdio.h>
#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

int main(void)
{
    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    int xRepLen = 100 * 1000;
    int xRepCount = 1000;

    unsigned short * xArray = new unsigned short[xRepLen];
    for (int xIdx = 0; xIdx < xRepLen; xIdx++)
        xArray[xIdx] = xIdx % USHRT_MAX;

    int * xResults = new int[xRepLen];

    for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
    {

        // in each repetition, find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
        for (int xIdx = 0; xIdx < xRepLen; xIdx++)
        {
            unsigned short xValToBreach = (xArray[xIdx] + 25) % USHRT_MAX;
            xResults[xIdx] = 0;

            for (int xIdx2 = xIdx + 1; xIdx2 < xRepLen; xIdx2++)
            if (xArray[xIdx2] >= xValToBreach)
            {
                xResults[xIdx] = xIdx2; break;
            }

            if (xResults[xIdx] == 0)
                xResults[xIdx] = INT_MAX;
        }
    }

    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(t2 - t1).count();
    cout << "Elasped miliseconds " << duration;
    getchar();
}

The C# code:

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace arrayBenchmarkCs
{
    class Program
    {
        public static void benchCs()
        {
            unsafe
            {
                int xRepLen = 100 * 1000;
                int xRepCount = 1000;

                ushort[] xArr = new ushort[xRepLen];
                for (int xIdx = 0; xIdx < xRepLen; xIdx++)
                    xArr[xIdx] = (ushort)(xIdx % 0xffff);

                int[] xResults = new int[xRepLen];

                Stopwatch xSw = new Stopwatch(); xSw.Start();
                fixed (ushort * xArrayStart = & xArr [0])
                {
                    for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
                    {

                        // in each repetition, go find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
                        ushort * xArrayEnd = xArrayStart + xRepLen;
                        for (ushort* xPtr = xArrayStart; xPtr != xArrayEnd; xPtr++)
                        {
                            ushort xValToBreach = (ushort)((*xPtr + 25) % 0xffff);
                            int xResult = -1;
                            for (ushort * xPtr2 = xPtr + 1; xPtr2 != xArrayEnd; xPtr2++)
                                if ( *xPtr2  >= xValToBreach)
                                {
                                    xResult = (int)(xPtr2 - xArrayStart);
                                    break;
                                }

                            if (xResult == -1)
                                xResult = int.MaxValue;

                            // save result
                            xResults[xPtr - xArrayStart] = xResult;
                        }
                    }
                }   // fixed

                xSw.Stop();

                Console.WriteLine("Elapsed miliseconds: " + (xSw.ElapsedMilliseconds.ToString("0"));
            }
        }

        static void Main(string[] args)
        {
            benchCs();
            Console.ReadKey();
        }
    }
}

On my work computer (i7-3770), the C++ version is approx 2x faster than the C# version. On my home computer (i7-5820K) the C++ is 1.5x faster than the C# version. Both are measured in Release. I hoped that by using pointers in C# I would avoid the array boundary checking and the performance would be the same in both languages.

So my questions are the following:

home come others are finding C# to be of the same speed as C++?
how can I get C# performance to the C++ level if not via pointers?
what could be the driver of different speedups on different computers?

Any hint is much appreciated, Daniel

jilles · Accepted Answer

Optimizing this kind of code requires using SIMD. That means processing multiple data items at a time (vectorization).

For C, C++ or Rust, see for example https://matklad.github.io/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html . Using these techniques relies on autovectorization.

C# does not appear to support autovectorization, but this particular inner loop can be implemented using MemoryExtensions.IndexOfAnyInRange documented at https://learn.microsoft.com/en-us/dotnet/api/system.memoryextensions.indexofanyinrange?view=net-8.0 which has a vectorized implementation for various primitive types. This results in a redundant comparison to ushort.MaxValue but in my quick benchmark it is still almost twice as fast as the simple loop with pointers. It needs fewer lines of code as well.

Sample loop:

    public int[] UsingSpan() {
        ushort[] a = xArr;
        int[] r = xResults;
        for (int i = 0; i < a.Length; i++) {
            ushort xValToBreach = (ushort)((a[i] + 25) % 0xffff);
            int xResult = a.AsSpan(i + 1).IndexOfAnyInRange(xValToBreach, ushort.MaxValue);
            r[i] = xResult == -1 ? int.MaxValue : xResult + i + 1;
        }
        return r;
    }

An if (a.Length != r.Length) throw new Exception(); can eliminate the bounds check on r and trickery using MemoryMarshal.CreateReadOnlySpan and Unsafe.Add can eliminate the bounds check from AsSpan, but these provide little benefit.

The % 0xffff operation is not optimized very well in .NET 9: it does not use a division instruction but still needs many instructions.

Alternatively, use generic explicit SIMD support such as https://learn.microsoft.com/en-us/dotnet/api/system.numerics.vector.greaterthan?view=net-8.0#system-numerics-vector-greaterthan-1(system-numerics-vector((-0))-system-numerics-vector((-0))) in C#. I have not tried this.

Alternatively, use intrinsics specific to the CPU and SIMD instruction set. This can be done in C, C++, Rust and C#.

By the way, x86-64 clang 19.1.0 with -O2 on godbolt optimizes out the entire C++ benchmark, since it sees the values stored to xResults are never read (it also needs adding #include <climits>).

C++ array vs C# ptr speed confusion

Tags:

c++

performance

arrays

c#

Daniel Bencik

1 Answers

jilles

Recent Activity

Donate For Us

C++ array vs C# ptr speed confusion

Tags:

c++

performance

arrays

c#

Daniel Bencik

1 Answers

jilles

Related questions

Recent Activity

Donate For Us