Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Improve performance of SIMD Sum [closed]

I'm writing a SIMD library and trying to squeeze every bit of performance.
I'm already casting in-place the array into a Span<Vector<int>>, instead of creating new objects.
Target arrays are of large size (more than 1000 elements).
Is there a more efficient way of summing an array?
Ideas are welcome.

    public static int Sum(int[] array)
    {
        Vector<int> vSum = Vector<T>.Zero;
        int sum;
        int i;

        Span<Vector<int>> vsArray = MemoryMarshal.Cast<int, Vector<int>>(array);

        for (i = 0; i < vsArray.Length; i++)
        {
            vSum += vsArray[i];
        }

        sum = Vector.Dot(vSum, Vector<int>.One);

        i *= Vector<int>.Count;

        for (; i < array.Length; i++)
        {
            sum += array[i];
        }

        return sum;
    }
like image 986
Gilad Freidkin Avatar asked Oct 24 '25 18:10

Gilad Freidkin


1 Answers

Your code is good. Only possible to improve by 4%, here's how:

// Test result: only 4% win on my PC.
[MethodImpl( MethodImplOptions.AggressiveInlining )]
static int sumUnsafeAvx2( int[] array )
{
    unsafe
    {
        fixed( int* sourcePointer = array )
        {
            int* pointerEnd = sourcePointer + array.Length;
            int* pointerEndAligned = sourcePointer + ( array.Length - array.Length % 16 );
            Vector256<int> sumLow = Vector256<int>.Zero;
            Vector256<int> sumHigh = sumLow;
            int* pointer;
            for( pointer = sourcePointer; pointer < pointerEndAligned; pointer += 16 )
            {
                var a = Avx.LoadVector256( pointer );
                var b = Avx.LoadVector256( pointer + 8 );
                sumLow = Avx2.Add( sumLow, a );
                sumHigh = Avx2.Add( sumHigh, b );
            }
            sumLow = Avx2.Add( sumLow, sumHigh );
            Vector128<int> res4 = Sse2.Add( sumLow.GetLower(), sumLow.GetUpper() );
            res4 = Sse2.Add( res4, Sse2.Shuffle( res4, 0x4E ) );
            res4 = Sse2.Add( res4, Sse2.Shuffle( res4, 1 ) );
            int scalar = res4.ToScalar();
            for( ; pointer < pointerEnd; pointer++ )
                scalar += *pointer;
            return scalar;
        }
    }
}

Here's a complete test.

To be clear, I don’t recommend doing what I wrote above. Not for the 4% improvement. Unsafe code is, well, unsafe. Your version will work without AVX2, and benefits from AVX512 if available, my optimized version gonna crash without AVX2, and won’t use AVX512 even if hardware supports it.

like image 196
Soonts Avatar answered Oct 26 '25 07:10

Soonts



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!