I have this simple loop:
int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
I compared its performance with its C++ version. I though that the performance should be near the same, because it is very simple code and the range checks are omitted in that case. But it turned out that the C++ version is almost three times faster. So I have implemented C# unsafe version but the performance was even worse. Resharper suggests to convert the loop to linq expression like this:
sum = array.Sum();
That code is many times slower than the original loop in C#
Could someone tell me if there is something more that I can do to improve the performance of this loop (without compiling it to 64 bit version - which is two times faster).
All test where made on 32 bit Release version and run without debugger.
Edit: Small correction. 64 bit version is two times faster with doubles, not ints
First a couple of general remarks about micro benchmarks like this:
ForEach
loop contains an anonymous delegate that is only JITted when it is first called and so the JIT time is included in the timing the first time the benchmark is run.There are four basic techniques for speeding up the code (if we keep it pure CLR):
Here is the parallel code:
var syncObj = new object();
Parallel.ForEach(Partitioner.Create(0, array.Length),
() => 0,
(src, state, partialSum) => {
int end = src.Item2;
for (int i = src.Item1; i < end; i++)
partialSum += array[i];
return partialSum;
},
partialSum => { lock (syncObj) { s += partialSum; } });
The Partitioner
class lives in the System.Collections.Concurrent
namespace.
On my machine (i7 950, 8 logical cores), the timings I got were:
For loop: 196.786 ms
For loop (separate method): 72.319 ms
Unrolled for loop: 196.167 ms
Unrolled for loop (separate method): 67.961 ms
Parallel.Foreach (1st time): 48.243 ms
Parallel.Foreach (2nd time): 26.356 ms
There was no significant difference between 32 bit and 64 bit code.
Unroll the loop 2-8 times. Measure which one is best. The .NET JIT optimizes poorly, so you have to do some of its work.
You'll probably have to add unsafe
as well because the JIT will now be unable to optimize out the array bounds checks.
You can also try to aggregate into multiple sum variables:
int sum1 = 0, sum2 = 0;
for (int i = 0; i < array.Length; i+=2) {
sum1 += array[i+0];
sum2 += array[i+1];
}
That might increase instruction-level parallelism because all add
instructions are now independent.
The i+0
is optimized to i
automatically.
I tested it and it shaved off about 30%.
The timings are stable when repeated. Code:
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
var watch = new Stopwatch();
int[] array = new int[500000000];
for (int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
//warmup
{
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
}
for (int i2 = 0; i2 < 5; i2++)
{
{
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
}
{
watch.Restart();
fixed (int* ptr = array)
{
int sum = 0;
var length = array.Length;
for (int i = 0; i < length; i++)
sum += ptr[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
}
}
{
watch.Restart();
fixed (int* ptr = array)
{
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
int sum4 = 0;
var length = array.Length;
for (int i = 0; i < length; i += 4)
{
sum1 += ptr[i + 0];
sum2 += ptr[i + 1];
sum3 += ptr[i + 2];
sum4 += ptr[i + 3];
}
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + (sum1 + sum2 + sum3 + sum4));
}
}
Console.WriteLine("===");
}
Further playing around, it turns out that multiple aggregation variables do nothing. Unrolling the loop did a major improvement, though. Unsafe did nothing (except in the unrolled case where it is pretty much required). Unrolling 2 times is as good as 4.
Running this on a Core i7.
var watch = new Stopwatch();
int[] array = new int[100000000];
for (int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
sum = array.Sum();
Console.WriteLine("linq sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
int length = array.Length;
for (int i = 0; i < length; i++)
sum += array[i];
Console.WriteLine("for loop fixed:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
foreach (int i in array)
{
sum += i;
}
Console.WriteLine("foreach sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
sum = 0;
watch.Restart();
sum = array.AsParallel().Sum();
Console.WriteLine("linq parallel sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
Linq Parallel seems to be fasted on my machine at least.
Fixing the length doesn't matter much but improves ~10%
There is not much you could actually do, unmanaged C code will always be faster for this.
Results on my PC are:
for loop: 241ms, result:100000000
linq sum: 559ms, result:100000000
for loop fixed:237ms, result:100000000
foreach sum: 295ms, result:100000000
linq parallel: 205ms, result:100000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With