Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using SIMD operation from C# in .NET framework 4.6 is slower

Tags:

c#

.net

ryujit

I am currently trying to calculate the sum of all the values in a huge array using just C# and using SIMD to compare performance and the SIMD version is considerably slower. Please see code snippets below and let me know if I am missing something. "vals" is the huge array that is read from an image file and omitted that part out to keep it lean.

var watch1 = new Stopwatch();
watch1.Start();
var total = vals.Aggregate(0, (a, i) => a + i);
watch1.Stop();
Console.WriteLine(string.Format("Total is: {0}", total));
Console.WriteLine(string.Format("Time taken: {0}", watch1.ElapsedMilliseconds));

var watch2 = new Stopwatch();
watch2.Start();
var sTotal = GetSIMDVectors(vals).Aggregate((a, i) => a + i);
int sum = 0;
for (int i = 0; i < Vector<int>.Count; i++)
    sum += sTotal[i];
watch2.Stop();
Console.WriteLine(string.Format("Another Total is: {0}", sum));
Console.WriteLine(string.Format("Time taken: {0}", watch2.ElapsedMilliseconds));

and the GetSIMDVectors method

private static IEnumerable<Vector<int>> GetSIMDVectors(short[] source)
{
    int vecCount = Vector<int>.Count;
    int i = 0;
    int len = source.Length;
    for(i = 0; i + vecCount < len; i = i + vecCount)
    {
        var items = new int[vecCount];
        for (int k = 0; k < vecCount; k++)
        {
            items[k] = source[i + k];
        }
        yield return new Vector<int>(items);
    }
    var remaining = new int[vecCount];
    for (int j = i, k =0; j < len; j++, k++)
    {
        remaining[k] = source[j];
    }
    yield return new Vector<int>(remaining);
}
like image 853
Vish Avatar asked Jan 07 '23 18:01

Vish


1 Answers

As @mike z has indicated, you need to make sure you are in release mode and targeting 64-bit or else RuyJIT, the compiler supporting SIMD, won't work(As for now it is only supported on 64-bit architectures). Also checking before execution is always a good practice to follow using:

Vector.IsHardwareAccelerated;

Also, you need not to use a for loop to create an array first before creating the vector. You shold simply create the vector from the original source array using the vector<int>(int[] array,int index) constructor.

yield return new Vector<int>(source, i);

instead of

var items = new int[vecCount];
for (int k = 0; k < vecCount; k++)
{
    items[k] = source[i + k];
}
yield return new Vector<int>(items);

This way, i managed to get nearly 3.7x increase in performance using a randomly generated large array.

Moreover, if you were to change your method with a one that directly calculates the sum as soon as it gets the valew of the new Vector<int>(source, i), like this:

private static int GetSIMDVectorsSum(int[] source)
    {
        int vecCount = Vector<int>.Count;
        int i = 0;
        int end_state = source.Length;

        Vector<int> temp = Vector<int>.Zero;


        for (; i < end_state; i += vecCount)
        {
            temp += new Vector<int>(source, i);

        }

        return Vector.Dot<int>(temp, Vector<int>.One);


    }

The performance increases more dramatically here. I managed to get a 16x increase in performance over vals.Aggregate(0, (a, i) => a + i) in my tests.

However, from a theoretical point of view, if for example Vector<int>.Countreturns 4, then anything above a 4x increase in performance indicates that you are comparing the vectorized version to a relatively unoptimized code.

That will be the vals.Aggregate(0, (a, i) => a + i) portion in your case. So basically, there is a plenty of room for you to optimize here.

When i replace it with a trivial for loop

private static int no_vec_sum(int[] vals)
{
    int end = vals.Length;
    int temp = 0;

    for (int i = 0; i < end; i++)
    {
        temp += vals[i];
    }
    return temp;
}

i only get a 1.5x increase in performance. Still an improvement though, for this very particular case, considering the simplicity of the operation.

Needless to say that Large arrays are required for the vectorized version to overcome the overhead induced by creating new Vector<int>() in each iteration.

like image 181
Ibrahem Atef Avatar answered Jan 28 '23 09:01

Ibrahem Atef