I've just written a small example checking, how C#'s optimizer behaves in case of indexers. The example is simple - I just wrap an array in a class and try to fill its values: once directly and once by indexer (which internally accesses the data exactly the same way as the direct solution does).
public class ArrayWrapper
{
public ArrayWrapper(int newWidth, int newHeight)
{
width = newWidth;
height = newHeight;
data = new int[width * height];
}
public int this[int x, int y]
{
get
{
return data[y * width + x];
}
set
{
data[y * width + x] = value;
}
}
public readonly int width, height;
public readonly int[] data;
}
public class Program
{
public static void Main(string[] args)
{
ArrayWrapper bigArray = new ArrayWrapper(15000, 15000);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int y = 0; y < bigArray.height; y++)
for (int x = 0; x < bigArray.width; x++)
bigArray.data[y * bigArray.width + x] = 12;
stopwatch.Stop();
Console.WriteLine(String.Format("Directly: {0} ms", stopwatch.ElapsedMilliseconds));
stopwatch.Restart();
for (int y = 0; y < bigArray.height; y++)
for (int x = 0; x < bigArray.width; x++)
bigArray[x, y] = 12;
stopwatch.Stop();
Console.WriteLine(String.Format("Via indexer: {0} ms", stopwatch.ElapsedMilliseconds));
Console.ReadKey();
}
}
Many SO posts taught me, that a programmer should highly trust optimizer to do its job. But in this case results are quite surprising:
Directly: 1282 ms
Via indexer: 2134 ms
(Compiled in Release configuration with the optimizations on, I double-checked).
That's a huge difference - no way being a statistical error (and it's both scalable and repeatable).
It's a very unpleasant surprise: in this case I'd expect the compiler to inline the indexer (it even does not include any range-checking), but it didn't do it. Here's the disassembly (note, that my comments are guesses on what is going on):
bigArray.data[y * bigArray.width + x] = 12;
000000a2 mov eax,dword ptr [ebp-3Ch] // Evaluate index of array
000000a5 mov eax,dword ptr [eax+4]
000000a8 mov edx,dword ptr [ebp-3Ch]
000000ab mov edx,dword ptr [edx+8]
000000ae imul edx,dword ptr [ebp-10h]
000000b2 add edx,dword ptr [ebp-14h] // ...until here
000000b5 cmp edx,dword ptr [eax+4] // Range checking
000000b8 jb 000000BF
000000ba call 6ED23CF5 // Throw IndexOutOfRange
000000bf mov dword ptr [eax+edx*4+8],0Ch // Assign value to array
bigArray[x, y] = 12;
0000015e push dword ptr [ebp-18h] // Push x and y
00000161 push 0Ch // (prepare parameters)
00000163 mov ecx,dword ptr [ebp-3Ch]
00000166 mov edx,dword ptr [ebp-1Ch]
00000169 cmp dword ptr [ecx],ecx
0000016b call dword ptr ds:[004B27DCh] // Call the indexer
(...)
data[y * width + x] = value;
00000000 push ebp
00000001 mov ebp,esp
00000003 sub esp,8
00000006 mov dword ptr [ebp-8],ecx
00000009 mov dword ptr [ebp-4],edx
0000000c cmp dword ptr ds:[004B171Ch],0 // Some additional checking, I guess?
00000013 je 0000001A
00000015 call 6ED24648
0000001a mov eax,dword ptr [ebp-8] // Evaluating index
0000001d mov eax,dword ptr [eax+4]
00000020 mov edx,dword ptr [ebp-8]
00000023 mov edx,dword ptr [edx+8]
00000026 imul edx,dword ptr [ebp+0Ch]
0000002a add edx,dword ptr [ebp-4] // ...until here
0000002d cmp edx,dword ptr [eax+4] // Range checking
00000030 jb 00000037
00000032 call 6ED23A5D // Throw IndexOutOfRange exception
00000037 mov ecx,dword ptr [ebp+8]
0000003a mov dword ptr [eax+edx*4+8],ecx // Actual assignment
}
0000003e nop
0000003f mov esp,ebp
00000041 pop ebp
00000042 ret 8 // Returning
That's a total disaster (in terms of code optimization). So my questions are:
Ok, I know, that the last one is hard to answer. But lately I read many questions about C++ performance and was amazed how much can optimizer do (for example, total inlining of std::tie
, two std::tuple
ctors and overloaded opeartor <
on the fly).
Edit: (in response to comments)
It seems, that actually that was still my fault, because I checked the performance while running the IDE. Now I ran the same program out of IDE and attached to it by debugger on-the-fly. Now I get:
bigArray.data[y * bigArray.width + x] = 12;
000000ae mov eax,dword ptr [ebp-10h]
000000b1 imul eax,edx
000000b4 add eax,ebx
000000b6 cmp eax,edi
000000b8 jae 000001FA
000000be mov dword ptr [ecx+eax*4+8],0Ch
bigArray[x, y] = 12;
0000016b mov eax,dword ptr [ebp-14h]
0000016e imul eax,edx
00000171 add eax,ebx
00000173 cmp eax,edi
00000175 jae 000001FA
0000017b mov dword ptr [ecx+eax*4+8],0Ch
These codes are exactly the same (in terms of CPU instructions). After running, the indexer version achieved even better results than direct one, but only (I guess) because of cache'ing. After putting the tests inside a loop, everything went back to normal:
Directly: 573 ms
Via indexer: 353 ms
Directly: 356 ms
Via indexer: 362 ms
Directly: 351 ms
Via indexer: 370 ms
Directly: 351 ms
Via indexer: 354 ms
Directly: 359 ms
Via indexer: 356 ms
Well; lesson learned. Even though compiling in Release mode, there is a huge difference, whether program is run in IDE or standalone. Thanks @harold for the idea.
Running code with the debugger immediately attached makes it generate slow code (unless you enable "Suppress JIT optimization on module load", but that makes debugging a little hard). The procedure I use to view the optimized assembly is to throw an exception (conditionally, say, if a static variable is 0, so the optimizer doesn't get too trigger-happy), and attach the debugger when it crashes. You'll probably have to go through the "Manually choose debuggers"-route. Also, make sure you enable "Show external code" (from the context menu on the call stack).
The code I got for the direct access was
innerloop:
mov eax,dword ptr [esi+8] ; bigArray.width
imul eax,ebx ; * y
add eax,edi ; + x
mov edx,dword ptr [ebp-14h] ; pointer to bigArray.data
mov ecx,dword ptr [ebp-10h] ; \
cmp eax,ecx ; | bounds check
jae 00000087 ; /
mov dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
inc edi ; x++
cmp edi,dword ptr [esi+8] ; \
jl innerloop ; / if (x < bigArray.width) goto innerloop
And for the indexer:
innerloop:
mov eax,dword ptr [esi+8] ; bigArray.width
imul eax,ebx ; * y
add eax,edi ; + x
mov edx,dword ptr [ebp-14h] ; pointer to bigArray.data
mov ecx,dword ptr [ebp-10h] ; \
cmp eax,ecx ; | bounds check
jae 00000087 ; /
mov dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
inc edi ; x++
cmp edi,dword ptr [esi+8] ; \
jl innerloop ; / if (x < bigArray.width) goto innerloop
This is not a paste-mistake, the code for the inner loop really was exactly the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With