Unexplainable timings for different optimisations

Question

I'm writing some code which has to apply a different algorithm to a large data set, depending upon a setting. The data set is large, and real-world timings have indicated that we need to optimise this if possible.

The selected algorithm has to be run on many subsets of data from a large array. I therefore decided to try several different approaches:

Intialise a Func<> delegate to reference the required algorithm. Call this delegate from within the main loop.
Loop over the data and call the appropriate algorithm from within the main loop.
Call a separate method which implements the main loop itself for each algorithm .

In my tests, I had each approach calling the same underlying method, calculate(). (Of course, the real code calls a different method for each algorithim, but here I am testing the quickest way of calling the algorithm, not the algorithm itself.)

Each of the tests call the required algorithm in a loop ITERS times.

In this test code, DataReductionAlgorithm is just an enum that defines the various algorithms. It's not really used other than to simulate what would happen in real code.

Here is my test implementation for approach (1). It's very straightforward: Assign the Func<> a to the algorithm to be called, then call it from a loop:

private static void test1(int[] data, DataReductionAlgorithm algorithm)
{
    Func<int[], int, int, int> a;

    switch (algorithm)
    {
        case DataReductionAlgorithm.Max:
            a = calculate;
            break;

        case DataReductionAlgorithm.Mean:
            a = calculate;
            break;

        default:
            a = calculate;
            break;
    }

    for (int i = 0; i < ITERS; ++i)
        a(data, 0, data.Length);
}

Here's my test implementation for approach (2). It moves the if test for the choice of algorithm outside the loop. I was expecting this to be the fastest approach:

private static void test2(int[] data, DataReductionAlgorithm algorithm)
{
    switch (algorithm)
    {
        case DataReductionAlgorithm.Max:

            for (int i = 0; i < ITERS; ++i)
                calculate(data, 0, data.Length);

            break;

        case DataReductionAlgorithm.Mean:

            for (int i = 0; i < ITERS; ++i)
                calculate(data, 0, data.Length);

            break;

        default:

            for (int i = 0; i < ITERS; ++i)
                calculate(data, 0, data.Length);

            break;
    }
}

Here's the code for testing approach (3). If moves the if test for the choice of algorithm inside the loop. I was expecting this to be slower that approach (2) because the if test is going to be performed ITERS times, instead of just once:

private static void test3(int[] data, DataReductionAlgorithm algorithm)
{
    for (int i = 0; i < ITERS; ++i)
    {
        switch (algorithm)
        {
            case DataReductionAlgorithm.Max:

                calculate(data, 0, data.Length);
                break;

            case DataReductionAlgorithm.Mean:

                calculate(data, 0, data.Length);
                break;

            default:

                calculate(data, 0, data.Length);
                break;
        }
    }
}

Because of the strange timing results I was getting, I added a new test which is almost identical to test2() except that instead of looping within the switch cases, I call a method to do exactly the same loop.

I therefore expected this to take almost the same time as test2():

private static void test4(int[] data, DataReductionAlgorithm algorithm)
{
    switch (algorithm)
    {
        case DataReductionAlgorithm.Max:

            iterate(ITERS, data);
            break;

        case DataReductionAlgorithm.Mean:

            iterate(ITERS, data);
            break;

        default:

            iterate(ITERS, data);
            break;
    }
}

private static void iterate(int n, int[] data)
{
    for (int i = 0; i < n; ++i)
        calculate(data, 0, data.Length);
}

Here is the entire program, in case anyone wants to try it themselves:

using System;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    public enum DataReductionAlgorithm
    {
        Single,
        Max,
        Mean
    }

    internal class Program
    {
        private const int ITERS = 100000;

        private void run()
        {
            int[] data = Enumerable.Range(0, 10000).ToArray();

            Stopwatch sw = new Stopwatch();

            for (int trial = 0; trial < 4; ++trial)
            {
                sw.Restart();
                test1(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test1: " + sw.Elapsed);

                sw.Restart();
                test2(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test2: " + sw.Elapsed);

                sw.Restart();
                test3(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test3: " + sw.Elapsed);

                sw.Restart();
                test4(data, DataReductionAlgorithm.Mean);
                Console.WriteLine("test4: " + sw.Elapsed);

                Console.WriteLine();
            }
        }

        private static void test1(int[] data, DataReductionAlgorithm algorithm)
        {
            Func<int[], int, int, int> a;

            switch (algorithm)
            {
                case DataReductionAlgorithm.Max:
                    a = calculate;
                    break;

                case DataReductionAlgorithm.Mean:
                    a = calculate;
                    break;

                default:
                    a = calculate;
                    break;
            }

            for (int i = 0; i < ITERS; ++i)
                a(data, 0, data.Length);
        }

        private static void test2(int[] data, DataReductionAlgorithm algorithm)
        {
            switch (algorithm)
            {
                case DataReductionAlgorithm.Max:

                    for (int i = 0; i < ITERS; ++i)
                        calculate(data, 0, data.Length);

                    break;

                case DataReductionAlgorithm.Mean:

                    for (int i = 0; i < ITERS; ++i)
                        calculate(data, 0, data.Length);

                    break;

                default:

                    for (int i = 0; i < ITERS; ++i)
                        calculate(data, 0, data.Length);

                    break;
            }
        }

        private static void test3(int[] data, DataReductionAlgorithm algorithm)
        {
            for (int i = 0; i < ITERS; ++i)
            {
                switch (algorithm)
                {
                    case DataReductionAlgorithm.Max:

                        calculate(data, 0, data.Length);
                        break;

                    case DataReductionAlgorithm.Mean:

                        calculate(data, 0, data.Length);
                        break;

                    default:

                        calculate(data, 0, data.Length);
                        break;
                }
            }
        }

        private static void test4(int[] data, DataReductionAlgorithm algorithm)
        {
            switch (algorithm)
            {
                case DataReductionAlgorithm.Max:

                    iterate(ITERS, data);
                    break;

                case DataReductionAlgorithm.Mean:

                    iterate(ITERS, data);
                    break;

                default:

                    iterate(ITERS, data);
                    break;
            }
        }

        private static void iterate(int n, int[] data)
        {
            for (int i = 0; i < n; ++i)
                calculate(data, 0, data.Length);
        }

        private static int calculate(int[] data, int i1, int i2)
        {
            // Just a dummy implementation.
            // Using the same algorithm for each approach to avoid differences in timings.

            int result = 0;

            for (int i = i1; i < i2; ++i)
                result += data[i];

            return result;
        }

        private static void Main()
        {
            new Program().run();
        }
    }
}

The results

Firstly, note that these results are from a RELEASE BUILD run from outside the debugger. Running a debug build - or running a release build from the debugger - will give misleading results.

I'm testing a build with .Net 4.51 on Windows 8.1 with a quad core Intel processor. (However, I got similar results with .Net 4.5 and .Net 4.)

I got different results depending on whether it was x64/AnyCPU or x86.

To recap: I was expecting test1() and test3() to be the slowest, and test2() to be the fastest, with test4() being almost the same speed as test2().

Here's the x86 results:

test1: 00:00:00.5892166
test2: 00:00:00.5848795
test3: 00:00:00.5866006
test4: 00:00:00.5867143

That's kindof what I expected, except that test1() was faster that I thought it would be (possibly indicating that calling a delegate is highly optimised).

Here's the x64 results:

test1: 00:00:00.8769743
test2: 00:00:00.8750667
test3: 00:00:00.5839475
test4: 00:00:00.5853400

Whoah!

What happened to test1() and test2()? I can't explain that. How can test2() be so much slower than test3()?

And why isn't test4() pretty much the same speed as test2()?

And why the huge difference between x86 and x64?

Can anyone shed any light on this? The difference in speed is not insignificant - it might make the difference between something taking 10 seconds and it taking 15 seconds.

ADDENDUM

I've accepted the answer below.

However, just to illustrate the fragility of JIT optimisations mentioned by @usr below, consider the following code:

using System;
using System.Diagnostics;

namespace Demo
{
    internal class Program
    {
        private const int ITERS = 10000;

        private void run()
        {
            Stopwatch sw = new Stopwatch();
            int[] data = new int[10000];

            for (int trial = 0; trial < 4; ++trial)
            {
                sw.Restart();
                test1(data, 0);
                var elapsed1 = sw.Elapsed;

                sw.Restart();
                test2(data, 0);
                var elapsed2 = sw.Elapsed;

                Console.WriteLine("Ratio = " + elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds);
            }

            Console.ReadLine();
        }

        private static void test1(int[] data, int x)
        {
            switch (x)
            {
                case 0:
                {
                    for (int i = 0; i < ITERS; ++i)
                        dummy(data);

                    break;
                }
            }
        }

        private static void test2(int[] data, int x)
        {
            switch (x)
            {
                case 0:
                {
                    loop(data);
                    break;
                }
            }
        }

        private static int dummy(int[] data)
        {
            int max = 0;

            // Also try with "int i = 1" in the loop below.

            for (int i = 0; i < data.Length; ++i)
                if (data[i] > max)
                    max = data[i];

            return max;
        }

        private static void loop(int[] data)
        {
            for (int i = 0; i < ITERS; ++i)
                dummy(data);
        }

        private static void Main()
        {
            new Program().run();
        }
    }
}

Note the line of code underneath the comment // Also try with "int i = 1" in the loop below..

With i = 0, I get the following results for release x64 build:

Ratio = 1.52235829774506
Ratio = 1.50636405328076
Ratio = 1.52291602053827
Ratio = 1.52803278744701

Merely changing that to i = 1, I get these results:

Ratio = 1.16920209593233
Ratio = 0.990370350435142
Ratio = 0.991150637472754
Ratio = 0.999941245001628

Interesting! :)

usr · Accepted Answer

I can reproduce the issue on x64, .NET 4.5, Release, no Debugger.

I have looked at the generated x64 for test2 and test3. The hot inner loop consumes 99% of the time. Only this loop matters.

For test3, calculate is inlined and the loop bounds are equal to the array bounds. This allows the JIT to eliminate the range check. In test2 the range check could not be eliminated because the loop bounds are dynamic. They are given by int i1, int i2 which are not statically known to be valid array bounds. Only inlining can provide that information in the current JIT. Inlining replaces these values by 0, data.Length.

This does not have to be so. The Hotspot JVM performs dynamic range check elimination. The .NET JIT is not sophisticated to do this.

test3 with inlining:

enter image description here

test2 with calculate not inlined:

enter image description here

Two branches instead of one. One is the loop test, one the range check.

I have no idea why the JIT inlined differently here. Inlining is driven by heuristics.

Unexplainable timings for different optimisations

Tags:

c#

optimization

Matthew Watson

1 Answers

usr

Recent Activity

Donate For Us

Unexplainable timings for different optimisations

Tags:

c#

optimization

Matthew Watson

1 Answers

usr

Related questions

Recent Activity

Donate For Us