Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cost of inlining methods in C#

I've recently implemented a QuickSort algorithm in C#. Sorting on an integer array containing millions of items, the code's performance is approximately 10% behind .NET's implementation.

private static void QS(int[] arr, int left, int right)
{
    if (left >= right) return;

    var pIndex = Partition(arr, left, right);
    QS( arr, left, pIndex);
    QS( arr, pIndex + 1, right);
}

On an array of 5 million items, this code is about 60ms slower than .NET's.

Subsequently, I created an another method that has the Partition() method inlined into QS() (eliminating the method call and the return statement). This however resulted in a drop in perfomance to about 250ms slower than .NET's sort method.

Why does this happen?

Edit: This is the code the Partition() method. In the inlined version of QS(), the entire content of this method with the exception of the return statement replaces the var pIndex = Partition(arr, left, right); line.

private static int Partition(int[] arr, int left, int right)
{
    int pivot = arr[left];
    int leftPoint = left - 1;
    int pIndex = right + 1;
    int temp = 0;

    while (true)
    {
        do { pIndex--; } while (arr[pIndex] > pivot);
        do { leftPoint++; } while (arr[leftPoint] < pivot);

        if (leftPoint < pIndex)
        {
            temp = arr[leftPoint];
            arr[leftPoint] = arr[pIndex];
            arr[pIndex] = temp;
        }
        else { break; }
    }
    return pIndex;
}

Edit #2: In case anyone's interested in compiling, here's the code that calls the algorithms:

Edit #3: New test code from Haymo's suggestion.

private static void Main(string[] args)
{
    const int globalRuns = 10;
    const int localRuns = 1000;

    var source = Enumerable.Range(1, 200000).OrderBy(n => Guid.NewGuid()).ToArray();
    var a = new int[source.Length];

    int start, end, total;

    for (int z = 0; z < globalRuns; z++)
    {
        Console.WriteLine("Run #{0}", z+1);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Array.Sort(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", ".NET", total, total / localRuns);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Quicksort.SortInline(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", "Inlined", total, total / localRuns);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Quicksort.SortNonInline(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\tTtl: {1}ms\tAvg: {2}ms\n", "Not inlined", total, total / localRuns);
    }
}
like image 388
Raintree Avatar asked Mar 04 '12 16:03

Raintree


1 Answers

Based on the information given in the question, one can only guess and name some ideas.

Did you measure correctly? Keep in mind, in order to get reliable performance results, one should (at least)

  • Run release builds without debugger attached
  • Repeat the test sufficiently often and average the results
  • Make the test fair, ie. give all test subjects the same 'ressource configuration'

In order to make sure, the source of the (assumed) performance drop is really connected to function inlining one can investigate the generated IL code. Or even better: the machine instructions generated by the JIT compiler.

For ILNumerics we implemented a custom quick sort and did a lot of performance measures. The final algorithm is several times faster then the CLR version. Manual inlining is only one improvement, which was neccessary for better performance. Others are:

  • Not using recursion
  • Using unsafe comparison / swapping
  • Using insertion sort for smaller partitions
  • Optimzing for limited size of the temporary (stack replacement) array

Very often, the source for strange performance results is found in the way, memory is (mis)used by an algorithm. Another might lay in the different instruction flow which eventually more or less succeed in getting optimized by any compiler/processor involved. The overall execution performance is a highly complex beast, hard to guess deterministically and hence a profiler is your best friend!

@Edit: By looking at your main test routine, it appears, you are mainly measuring the memory bandwidth of your processor/main memory. A int[] array of length 5*10e6 is roughly 19 MB in size. This most probably is out of range of your caches. So the processor will wait for memory due to compulsory cache misses most the time. This makes a measure of the influence of any code reformulation hard to guess. I suggest to try to measure this way instead: (pseudo code)

  • generate test data
  • allocate array for copies
  • iterate over number of global repetitions (let say 10)

    • inner repetitions for Array.Sort (eg. 1000)
      • copy the (unsorted) test data
      • sort the copy by Array.Sort
      • add time
    • average time for Array.Sort

    • inner repetitions for Quicksort.Sort (eg. 1000)

      • copy the (unsorted) test data
      • sort the copy by Quicksort.Sort
      • add time
    • average time for Quicksort.Sort

    • inner repetitions for Quicksort.Sort2 (eg. 1000)

      • copy the (unsorted) test data
      • sort the copy by Quicksort.Sort2
      • add time
    • average time for Quicksort.Sort2

The goal is to make the quicksort only use data from the caches. Therefore, make sure not to recreate the copies from new memory but rather have only two global instances: the original and the copy to get sorted. Both must fit into your (last level) cache at the same time! With some headroom (for other processes on the system) a good guess is to use up only half of the available last level cache size for both arrays. Depending on your true cache size, a test length of 250k seems more reasonable.

@Edit2: I ran your code, got the same results and watched the (optimized) machine instructions in the VS debugger. Here comes the relevant part from both versions:

Not inlined: 
    69:                 do { pIndex--; } while (arr[pIndex] > pivot);
00000017  dec         ebx 
00000018  cmp         ebx,esi 
0000001a  jae         00000053 
0000001c  cmp         dword ptr [ecx+ebx*4+8],edi 
00000020  jg          00000017 
    70:                 do { leftPoint++; } while (arr[leftPoint] < pivot);
00000022  inc         edx 
00000023  cmp         edx,esi 
00000025  jae         00000053 
00000027  cmp         dword ptr [ecx+edx*4+8],edi 
0000002b  jl          00000022 


Inlined: 
    97:                 do { pIndex--; } while (arr[pIndex] > pivot);
00000038  dec         dword ptr [ebp-14h] 
0000003b  mov         eax,dword ptr [ebp-14h] 
0000003e  cmp         eax,edi 
00000040  jae         00000097 
00000042  cmp         dword ptr [esi+eax*4+8],ebx 
00000046  jg          00000038 
    98:                 do { leftPoint++; } while (arr[leftPoint] < pivot);
00000048  inc         ecx 
00000049  cmp         ecx,edi 
0000004b  jae         00000097 
0000004d  cmp         dword ptr [esi+ecx*4+8],ebx 
00000051  jl          00000048 

As can be seen, the 'Not inlined' version does better utilize the registers for decrementing the running index (Line 69 / 97). Obviously the JIT decided not to push and pop the corresponding register on the stack, because of other code in the same function is making use of the same register. Since this is a hot loop (and the CLR does not recognize this) the overall execution speed suffers. So, in this specific case, manually inlining the Partition function is not profitable.

However, as you know, there is no guarantee that other versions of the CLR do the same. Differences may even arise for 64 bit.

like image 174
Haymo Kutschbach Avatar answered Oct 19 '22 06:10

Haymo Kutschbach