Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can I do to make this loop run faster?

I have this simple loop:

int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];

I compared its performance with its C++ version. I though that the performance should be near the same, because it is very simple code and the range checks are omitted in that case. But it turned out that the C++ version is almost three times faster. So I have implemented C# unsafe version but the performance was even worse. Resharper suggests to convert the loop to linq expression like this:

sum = array.Sum();

That code is many times slower than the original loop in C#

Could someone tell me if there is something more that I can do to improve the performance of this loop (without compiling it to 64 bit version - which is two times faster).

All test where made on 32 bit Release version and run without debugger.

Edit: Small correction. 64 bit version is two times faster with doubles, not ints

like image 992
userx01233433 Avatar asked Oct 13 '13 16:10

userx01233433


3 Answers

First a couple of general remarks about micro benchmarks like this:

  • The timings here are so short that JIT time can be significant. This is important because a parallel ForEach loop contains an anonymous delegate that is only JITted when it is first called and so the JIT time is included in the timing the first time the benchmark is run.
  • The context of the code is important as well. The JITter can do a better job optimizing small functions. Isolating the benchmark code in its own function may have a significant impact on performance.

There are four basic techniques for speeding up the code (if we keep it pure CLR):

  1. Parallellize it. This is obvious.
  2. Unroll loops. This reduces the number of instructions by only doing a comparison every 2 or more iterations.
  3. Using unsafe code. This is not much of a benefit in this case because the main issue (range checks on the array) is optimized away.
  4. Allow the code to be optimized better. We can do this by putting the actual benchmark code in a separate method.

Here is the parallel code:

var syncObj = new object();
Parallel.ForEach(Partitioner.Create(0, array.Length),
    () => 0,
    (src, state, partialSum) => {
        int end = src.Item2;
        for (int i = src.Item1; i < end; i++)
            partialSum += array[i];
        return partialSum;
    },
    partialSum => { lock (syncObj) { s += partialSum; } });

The Partitioner class lives in the System.Collections.Concurrent namespace.

On my machine (i7 950, 8 logical cores), the timings I got were:

For loop: 196.786 ms
For loop (separate method): 72.319 ms
Unrolled for loop: 196.167 ms
Unrolled for loop (separate method): 67.961 ms
Parallel.Foreach (1st time): 48.243 ms
Parallel.Foreach (2nd time): 26.356 ms

There was no significant difference between 32 bit and 64 bit code.

like image 108
Jeffrey Sax Avatar answered Nov 07 '22 21:11

Jeffrey Sax


Unroll the loop 2-8 times. Measure which one is best. The .NET JIT optimizes poorly, so you have to do some of its work.

You'll probably have to add unsafe as well because the JIT will now be unable to optimize out the array bounds checks.

You can also try to aggregate into multiple sum variables:

int sum1 = 0, sum2 = 0;
for (int i = 0; i < array.Length; i+=2) {
    sum1 += array[i+0];
    sum2 += array[i+1];
}

That might increase instruction-level parallelism because all add instructions are now independent.

The i+0 is optimized to i automatically.


I tested it and it shaved off about 30%.

The timings are stable when repeated. Code:

        Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;

        var watch = new Stopwatch();

        int[] array = new int[500000000];
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = 1;
        }

        //warmup
        {
            watch.Restart();
            int sum = 0;
            for (int i = 0; i < array.Length; i++)
                sum += array[i];
        }

        for (int i2 = 0; i2 < 5; i2++)
        {
            {
                watch.Restart();
                int sum = 0;
                for (int i = 0; i < array.Length; i++)
                    sum += array[i];
                Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
            }

            {
                watch.Restart();
                fixed (int* ptr = array)
                {
                    int sum = 0;
                    var length = array.Length;
                    for (int i = 0; i < length; i++)
                        sum += ptr[i];
                    Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
                }
            }

            {
                watch.Restart();
                fixed (int* ptr = array)
                {
                    int sum1 = 0;
                    int sum2 = 0;
                    int sum3 = 0;
                    int sum4 = 0;
                    var length = array.Length;
                    for (int i = 0; i < length; i += 4)
                    {
                        sum1 += ptr[i + 0];
                        sum2 += ptr[i + 1];
                        sum3 += ptr[i + 2];
                        sum4 += ptr[i + 3];
                    }
                    Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + (sum1 + sum2 + sum3 + sum4));
                }
            }

            Console.WriteLine("===");
        }

Further playing around, it turns out that multiple aggregation variables do nothing. Unrolling the loop did a major improvement, though. Unsafe did nothing (except in the unrolled case where it is pretty much required). Unrolling 2 times is as good as 4.

Running this on a Core i7.

like image 22
usr Avatar answered Nov 07 '22 20:11

usr


var watch = new Stopwatch();

int[] array = new int[100000000];
for (int i = 0; i < array.Length; i++)
{
    array[i] = 1;
}

watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
sum = array.Sum();
Console.WriteLine("linq sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
int length = array.Length;
for (int i = 0; i < length; i++)
    sum += array[i];
Console.WriteLine("for loop fixed:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
foreach (int i in array)
{
    sum += i;
}
Console.WriteLine("foreach sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
sum = array.AsParallel().Sum();
Console.WriteLine("linq parallel sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

Linq Parallel seems to be fasted on my machine at least.

Fixing the length doesn't matter much but improves ~10%

There is not much you could actually do, unmanaged C code will always be faster for this.

Results on my PC are:

for loop:      241ms, result:100000000
linq sum:      559ms, result:100000000
for loop fixed:237ms, result:100000000
foreach sum:   295ms, result:100000000
linq parallel: 205ms, result:100000000
like image 42
MichaC Avatar answered Nov 07 '22 21:11

MichaC