I was writing a few benchmarking tests to figure out why a similar pure algorithm (no C++ lib / .net built in classes) ran much faster in C++ than in C#, even when accounting for the expected feature différences. And while doing so i stumbled on these 2 tests that baffled me, does anyone have an idea about why one is substantially slower thant he other? The only difference in the 2nd one (that takes 51ms vs 88 on my machine) is that the 2 arrays are declared locally in the method instead of outside. In both cases the arrays are created before we start timing.
const int Runs = 100;
const int Width = 5000;
const int Height = 5000;
const int Size = Width * Height;
static int[] Input = Enumerable.Range(0, Size).ToArray();
static int[] Output = new int[Size * 2];
static int SimpleTest()
{
// Removing those 2 lines and using the static arrays instead give substantially slower performance, nearly half the speed!
int[] Input = Enumerable.Range(0, Size).ToArray();
int[] Output = new int[Size * 2];
Stopwatch sw = new Stopwatch();
sw.Start();
for (int run = 0; run < Runs; run++)
{
int InputIndex = 0;
for (int x = 0; x < Width; x++)
{
for (int y = 0; y < Height; y++)
{
int pixel = Input[InputIndex];
var OutputIndex = InputIndex * 2;
Output[OutputIndex] = pixel;
Output[OutputIndex + 1] = pixel;
InputIndex++;
}
}
}
sw.Stop();
return (int)(sw.ElapsedMilliseconds / Runs);
}
An array is faster and that is because ArrayList uses a fixed amount of array. However when you add an element to the ArrayList and it overflows. It creates a new Array and copies every element from the old one to the new one. List over arrays.
Difference Between Static Array and Dynamic ArrayStatic arrays are allocated memory at compile time. Dynamic array is located at run-time. The size of static array is fixed. The size of dynamic array is fixed.
Arrays can store data very compactly and are more efficient for storing large amounts of data. Arrays are great for numerical operations; lists cannot directly handle math operations. For example, you can divide each element of an array by the same number with just one line of code.
When the variables are local, the compiler knows that Input
and Output
will never change, which opens up many optimizations.
Input
and Output
variables can be kept in registers.Input.Length
and Output.Length
can be calculated once and cached.Input[InputIndex]
and Output[OutputIndex]
will never result in an array index out of bounds, so the bounds check can be optimized out.Input
and Output
are never used, so it could optimize the loops to nothing!If you use the static versions, then the compiler cannot perform these optimizations. The compiler must reload Input
and Output
at each access, and must perform a bounds check at every array index operation, just in case another thread modified Input
or Output
.
For example, if another thread does Input = new int[Size]
, then all future calculations must proceed with this alternate Input
. And if another thread does Output = new int[1]
, then the code must raise an IndexOutOfRangeException
.
With the 32-bit JIT I believe the culprit is, as Raymond Chen mentionned, that Input and Output can be kept in registers when they are locals, but they need to be reloaded every time when they are not. Generated assembly:
For locals:
007426F0 mov eax,dword ptr [ebp-18h]
007426F3 mov edi,dword ptr [eax+4]
int pixel = Input[InputIndex];
007426F6 mov eax,dword ptr [ebp-18h]
007426F9 cmp edx,edi
007426FB jae 0074276E
007426FD mov ecx,dword ptr [eax+edx*4+8]
For statics:
011C2718 mov dword ptr [ebp-18h],edx
011C271B mov esi,dword ptr ds:[3BB7E90h]
011C2721 mov eax,dword ptr [esi+4]
011C2724 mov dword ptr [ebp-1Ch],eax
int pixel = Input[InputIndex];
011C2727 mov eax,dword ptr [ebp-1Ch]
011C272A cmp ecx,eax
011C272C jae 011C27A2
011C272E mov edi,dword ptr [esi+ecx*4+8]
As you can see, mov esi,dword ptr ds:[3BB7E90h]
accesses the data segment.
As you can also see, bounds checking occur in both cases (cmp-jae
) so that's irrelevant, and the loops are not in fact optimized to nothing.
How the 64-bit JIT avoids the issue is beyond me.
Here's the complete dissassembly for both cases:
Fast version:
for (int x = 0; x < Width; x++) {
007426EB mov dword ptr [ebp-14h],edx
for (int y = 0; y < Height; y++) {
007426EE xor ebx,ebx
007426F0 mov eax,dword ptr [ebp-18h]
007426F3 mov edi,dword ptr [eax+4]
int pixel = Input[InputIndex];
007426F6 mov eax,dword ptr [ebp-18h]
007426F9 cmp edx,edi
007426FB jae 0074276E
007426FD mov ecx,dword ptr [eax+edx*4+8]
var OutputIndex = InputIndex * 2;
00742701 mov esi,edx
00742703 add esi,esi
Output[OutputIndex] = pixel;
00742705 mov eax,dword ptr [ebp-1Ch]
00742708 cmp esi,dword ptr [eax+4]
0074270B jae 0074276E
0074270D mov dword ptr [eax+esi*4+8],ecx
Output[OutputIndex + 1] = pixel;
00742711 inc esi
00742712 mov eax,dword ptr [ebp-1Ch]
00742715 cmp esi,dword ptr [eax+4]
00742718 jae 0074276E
0074271A mov dword ptr [eax+esi*4+8],ecx
InputIndex++;
0074271E inc edx
for (int y = 0; y < Height; y++) {
0074271F inc ebx
for (int y = 0; y < Height; y++) {
00742720 cmp ebx,1388h
00742726 jl 007426F6
for (int x = 0; x < Width; x++) {
00742728 inc dword ptr [ebp-14h]
0074272B cmp dword ptr [ebp-14h],1388h
00742732 jl 007426EE
Slow version:
for (int x = 0; x < Width; x++) {
011C2713 mov dword ptr [ebp-14h],ecx
for (int y = 0; y < Height; y++) {
011C2716 xor edx,edx
011C2718 mov dword ptr [ebp-18h],edx
011C271B mov esi,dword ptr ds:[3BB7E90h]
011C2721 mov eax,dword ptr [esi+4]
011C2724 mov dword ptr [ebp-1Ch],eax
int pixel = Input[InputIndex];
011C2727 mov eax,dword ptr [ebp-1Ch]
011C272A cmp ecx,eax
011C272C jae 011C27A2
011C272E mov edi,dword ptr [esi+ecx*4+8]
var OutputIndex = InputIndex * 2;
011C2732 mov ebx,ecx
011C2734 add ebx,ebx
Output[OutputIndex] = pixel;
011C2736 mov edx,dword ptr ds:[3BB7E94h]
011C273C cmp ebx,dword ptr [edx+4]
011C273F jae 011C27A2
011C2741 mov dword ptr [edx+ebx*4+8],edi
Output[OutputIndex + 1] = pixel;
011C2745 inc ebx
011C2746 cmp ebx,dword ptr [edx+4]
011C2749 jae 011C27A2
011C274B mov dword ptr [edx+ebx*4+8],edi
InputIndex++;
011C274F inc ecx
for (int y = 0; y < Height; y++) {
011C2750 inc dword ptr [ebp-18h]
011C2753 cmp dword ptr [ebp-18h],1388h
011C275A jl 011C2727
for (int x = 0; x < Width; x++) {
011C275C inc dword ptr [ebp-14h]
011C275F cmp dword ptr [ebp-14h],1388h
011C2766 jl 011C2716
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With