I've recently implemented a QuickSort algorithm in C#. Sorting on an integer array containing millions of items, the code's performance is approximately 10% behind .NET's implementation.
private static void QS(int[] arr, int left, int right)
{
if (left >= right) return;
var pIndex = Partition(arr, left, right);
QS( arr, left, pIndex);
QS( arr, pIndex + 1, right);
}
On an array of 5 million items, this code is about 60ms slower than .NET's.
Subsequently, I created an another method that has the Partition()
method inlined into QS()
(eliminating the method call and the return
statement). This however resulted in a drop in perfomance to about 250ms slower than .NET's sort method.
Why does this happen?
Edit: This is the code the Partition()
method. In the inlined version of QS()
, the entire content of this method with the exception of the return
statement replaces the var pIndex = Partition(arr, left, right);
line.
private static int Partition(int[] arr, int left, int right)
{
int pivot = arr[left];
int leftPoint = left - 1;
int pIndex = right + 1;
int temp = 0;
while (true)
{
do { pIndex--; } while (arr[pIndex] > pivot);
do { leftPoint++; } while (arr[leftPoint] < pivot);
if (leftPoint < pIndex)
{
temp = arr[leftPoint];
arr[leftPoint] = arr[pIndex];
arr[pIndex] = temp;
}
else { break; }
}
return pIndex;
}
Edit #2: In case anyone's interested in compiling, here's the code that calls the algorithms:
Edit #3: New test code from Haymo's suggestion.
private static void Main(string[] args)
{
const int globalRuns = 10;
const int localRuns = 1000;
var source = Enumerable.Range(1, 200000).OrderBy(n => Guid.NewGuid()).ToArray();
var a = new int[source.Length];
int start, end, total;
for (int z = 0; z < globalRuns; z++)
{
Console.WriteLine("Run #{0}", z+1);
total = 0;
for (int i = 0; i < localRuns; i++)
{
Array.Copy(source, a, source.Length);
start = Environment.TickCount;
Array.Sort(a);
end = Environment.TickCount;
total += end - start;
}
Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", ".NET", total, total / localRuns);
total = 0;
for (int i = 0; i < localRuns; i++)
{
Array.Copy(source, a, source.Length);
start = Environment.TickCount;
Quicksort.SortInline(a);
end = Environment.TickCount;
total += end - start;
}
Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", "Inlined", total, total / localRuns);
total = 0;
for (int i = 0; i < localRuns; i++)
{
Array.Copy(source, a, source.Length);
start = Environment.TickCount;
Quicksort.SortNonInline(a);
end = Environment.TickCount;
total += end - start;
}
Console.WriteLine("{0}\tTtl: {1}ms\tAvg: {2}ms\n", "Not inlined", total, total / localRuns);
}
}
Based on the information given in the question, one can only guess and name some ideas.
Did you measure correctly? Keep in mind, in order to get reliable performance results, one should (at least)
In order to make sure, the source of the (assumed) performance drop is really connected to function inlining one can investigate the generated IL code. Or even better: the machine instructions generated by the JIT compiler.
For ILNumerics we implemented a custom quick sort and did a lot of performance measures. The final algorithm is several times faster then the CLR version. Manual inlining is only one improvement, which was neccessary for better performance. Others are:
Very often, the source for strange performance results is found in the way, memory is (mis)used by an algorithm. Another might lay in the different instruction flow which eventually more or less succeed in getting optimized by any compiler/processor involved. The overall execution performance is a highly complex beast, hard to guess deterministically and hence a profiler is your best friend!
@Edit: By looking at your main test routine, it appears, you are mainly measuring the memory bandwidth of your processor/main memory. A int[] array of length 5*10e6 is roughly 19 MB in size. This most probably is out of range of your caches. So the processor will wait for memory due to compulsory cache misses most the time. This makes a measure of the influence of any code reformulation hard to guess. I suggest to try to measure this way instead: (pseudo code)
iterate over number of global repetitions (let say 10)
average time for Array.Sort
inner repetitions for Quicksort.Sort (eg. 1000)
average time for Quicksort.Sort
inner repetitions for Quicksort.Sort2 (eg. 1000)
The goal is to make the quicksort only use data from the caches. Therefore, make sure not to recreate the copies from new memory but rather have only two global instances: the original and the copy to get sorted. Both must fit into your (last level) cache at the same time! With some headroom (for other processes on the system) a good guess is to use up only half of the available last level cache size for both arrays. Depending on your true cache size, a test length of 250k seems more reasonable.
@Edit2: I ran your code, got the same results and watched the (optimized) machine instructions in the VS debugger. Here comes the relevant part from both versions:
Not inlined:
69: do { pIndex--; } while (arr[pIndex] > pivot);
00000017 dec ebx
00000018 cmp ebx,esi
0000001a jae 00000053
0000001c cmp dword ptr [ecx+ebx*4+8],edi
00000020 jg 00000017
70: do { leftPoint++; } while (arr[leftPoint] < pivot);
00000022 inc edx
00000023 cmp edx,esi
00000025 jae 00000053
00000027 cmp dword ptr [ecx+edx*4+8],edi
0000002b jl 00000022
Inlined:
97: do { pIndex--; } while (arr[pIndex] > pivot);
00000038 dec dword ptr [ebp-14h]
0000003b mov eax,dword ptr [ebp-14h]
0000003e cmp eax,edi
00000040 jae 00000097
00000042 cmp dword ptr [esi+eax*4+8],ebx
00000046 jg 00000038
98: do { leftPoint++; } while (arr[leftPoint] < pivot);
00000048 inc ecx
00000049 cmp ecx,edi
0000004b jae 00000097
0000004d cmp dword ptr [esi+ecx*4+8],ebx
00000051 jl 00000048
As can be seen, the 'Not inlined' version does better utilize the registers for decrementing the running index (Line 69 / 97). Obviously the JIT decided not to push and pop the corresponding register on the stack, because of other code in the same function is making use of the same register. Since this is a hot loop (and the CLR does not recognize this) the overall execution speed suffers. So, in this specific case, manually inlining the Partition function is not profitable.
However, as you know, there is no guarantee that other versions of the CLR do the same. Differences may even arise for 64 bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With