I'm interested in how efficient low-level algorithms can be in .net. I would like to enable us to choose to write more of our code in C# rather than C++ in the future, but one stumbling block is the bounds checking in .net that occurs with looping and random access to arrays.
A motivating example is a function that calculates the sum of products of corresponding elements in two arrays (this is the dot product of two vectors).
static void SumProduct(double[] X, double[] Y) { double sum = 0; int length = X.Length; if (length != Y.Length) throw new ArgumentException("X and Y must be same size"); for (int i = 0; i < length; i++) // Check X.Length instead? See below sum += X[i] * Y[i]; }
From what I can tell, and don't know enough IL or x86 to check, the compiler won't optimize out bounds checking of X
and Y
. Am I wrong and/or is there a way to write my code to allow the compiler to help me out?
Further details
There are many efficiency-arguments for and against using particular languages, not least that it is better to concentrate on "big O" algorithmic cost rather than the constant of proportionality, and higher level languages help you to do this. On the subject of bounds checking in .net, the best article I found is Array Bounds Check Elimination in the CLR on MSDN (also referenced in a stack overflow answer on the importance of enabling optimization).
This dates from 2009, so I wonder whether things have changed significantly since then. Also, the article reveals some real subtleties that would have caught me out so for this reason alone I would welcome some expert advice.
For example it appears that in my code above I would have better off writing i< X.Length
rather than i < length
. Also, I had also naively assumed that for an algorithm with a single array, writing a foreach
loop would better declare your intent to the compiler and give it the best chance of optimizing out the bounds checking.
According to the MSDN article, SumForBAD
, below, which I thought was sure to be optimized, would not be. Whereas SumFor
would be straightforwardly optimized, and SumForEach
would also be optimized, but not trivially (and might not be optimized at all if the array were passed into a function as IEnumerable<int>
)?
static double SumForBAD(double[] X) { double sum = 0; int length = X.Length; // better to use i < X.length in loop for (int i = 0; i < length; i++) sum += X[i]; return sum; } static double SumFor(double[] X) { double sum = 0; for (int i = 0; i < X.Length; i++) sum += X[i]; return sum; } static double SumForEach(double[] X) { double sum = 0; foreach (int element in X) sum += element; return sum; }
I did some investigation based on doug65536's answer. In C++, I compared the times of a SumProduct that does one bounds-check
for(int i=0; i<n; ++i) sum += v1[i]*v2[i];
against another version that does two bounds-checks
for(int i=0; i<n1 && i <n2; ++i) sum += v1[i]*v2[i];
I found that the second version was slower, but only by about 3.5% (Visual Studio 2010, optimized build, default options). However it occurred to me that in C#, there might be three bounds checks. One explicit (i < length
in the function static void SumProduct(double[] X, double[] Y)
at the start of this question), and two implicit (X[i]
and Y[i]
). So I tested a third C++ function, with three bounds checks
for(int i=0; i<n1 && i <n2 && i <n3; ++i) sum += v1[i]*v2[i];
This came in 35% slower than the first, which is worth caring about. I did some more investigation in this question, Why does adding extra check in loop make big difference on some machines, and small difference on others?. Interestingly, it seems as though the cost of bounds checking varies significantly on different machines.
Array bound checking refers to determining whether all array references in a program are within their declared ranges. This checking is critical for software verification and validation because subscripting arrays beyond their declared sizes may produce unexpected results, security holes, or failures.
In computer programming, bounds checking is any method of detecting whether a variable is within some bounds before it is used. It is usually used to ensure that a number fits into a given type (range checking), or that a variable being used as an array index is within the bounds of the array (index checking).
The obvious advantage of array bounds checking approaches are that they completely eliminate buffer overflow vulnerabilities. However, these are also the most expensive solution, particularly for pointer- and array-intensive programs since every pointer and array operation must be checked.
std::array provides many benefits over built-in arrays, such as preventing automatic decay into a pointer, maintaining the array size, providing bounds checking, and allowing the use of C++ container operations.
The bounds check won't matter because:
The bounds check consists of a cmp
/jae
instruction pair, which is fused into a single micro-op on modern CPU architectures (the term is "macro-op fusion"). Compare and branch is very highly optimized.
The bounds check is a forward branch, which will be statically predicted to be not-taken, also reducing the cost. The branch will never be taken. (If it ever is taken, an exception will throw anyway, so the mispredict cost becomes utterly irrelevant)
As soon as there is any memory delay, speculative execution will queue up many iterations of the loop, so the cost of decoding the extra instruction pair almost disappears.
Memory access will likely be your bottleneck, so the effect micro-optimizations like removing bounds checks will disappear.
The 64-bit jitter does a good job of eliminating bounds checks (at least in straightforward scenarios). I added return sum;
at the end of your method and then compiled the program using Visual Studio 2010 in Release mode. In the disassembly below (which I annotated with a C# translation), notice that:
X
, even though your code compares i
against length
instead of X.Length
. This is an improvement over the behavior described in the article.Y.Length >= X.Length
.Disassembly
; Register assignments: ; rcx := i ; rdx := X ; r8 := Y ; r9 := X.Length ("length" in your code, "XLength" below) ; r10 := Y.Length ("YLength" below) ; r11 := X.Length - 1 ("XLengthMinus1" below) ; xmm1 := sum ; (Prologue) 00000000 push rbx 00000001 push rdi 00000002 sub rsp,28h ; (Store arguments X and Y in rdx and r8) 00000006 mov r8,rdx ; Y 00000009 mov rdx,rcx ; X ; int XLength = X.Length; 0000000c mov r9,qword ptr [rdx+8] ; int XLengthMinus1 = XLength - 1; 00000010 movsxd rax,r9d 00000013 lea r11,[rax-1] ; int YLength = Y.Length; 00000017 mov r10,qword ptr [r8+8] ; if (XLength != YLength) ; throw new ArgumentException("X and Y must be same size"); 0000001b cmp r9d,r10d 0000001e jne 0000000000000060 ; double sum = 0; 00000020 xorpd xmm1,xmm1 ; if (XLength > 0) ; { 00000024 test r9d,r9d 00000027 jle 0000000000000054 ; int i = 0; 00000029 xor ecx,ecx 0000002b xor eax,eax ; if (XLengthMinus1 >= YLength) ; throw new IndexOutOfRangeException(); 0000002d cmp r11,r10 00000030 jae 0000000000000096 ; do ; { ; sum += X[i] * Y[i]; 00000032 movsd xmm0,mmword ptr [rdx+rax+10h] 00000038 mulsd xmm0,mmword ptr [r8+rax+10h] 0000003f addsd xmm0,xmm1 00000043 movapd xmm1,xmm0 ; i++; 00000047 inc ecx 00000049 add rax,8 ; } ; while (i < XLength); 0000004f cmp ecx,r9d 00000052 jl 0000000000000032 ; } ; return sum; 00000054 movapd xmm0,xmm1 ; (Epilogue) 00000058 add rsp,28h 0000005c pop rdi 0000005d pop rbx 0000005e ret 00000060 ... 00000096 ...
The 32-bit jitter, unfortunately, is not quite as smart. In the disassembly below, notice that:
X
, even though your code compares i
against length
instead of X.Length
. Again, this is an improvement over the behavior described in the article.Y
.Disassembly
; Register assignments: ; eax := i ; ecx := X ; edx := Y ; esi := X.Length ("length" in your code, "XLength" below) ; (Prologue) 00000000 push ebp 00000001 mov ebp,esp 00000003 push esi ; double sum = 0; 00000004 fldz ; int XLength = X.Length; 00000006 mov esi,dword ptr [ecx+4] ; if (XLength != Y.Length) ; throw new ArgumentException("X and Y must be same size"); 00000009 cmp dword ptr [edx+4],esi 0000000c je 00000012 0000000e fstp st(0) 00000010 jmp 0000002F ; int i = 0; 00000012 xor eax,eax ; if (XLength > 0) ; { 00000014 test esi,esi 00000016 jle 0000002C ; do ; { ; double temp = X[i]; 00000018 fld qword ptr [ecx+eax*8+8] ; if (i >= Y.Length) ; throw new IndexOutOfRangeException(); 0000001c cmp eax,dword ptr [edx+4] 0000001f jae 0000005A ; sum += temp * Y[i]; 00000021 fmul qword ptr [edx+eax*8+8] 00000025 faddp st(1),st ; i++; 00000027 inc eax ; while (i < XLength); 00000028 cmp eax,esi 0000002a jl 00000018 ; } ; return sum; 0000002c pop esi 0000002d pop ebp 0000002e ret 0000002f ... 0000005a ...
The jitter has improved since 2009, and the 64-bit jitter can generate more efficient code than the 32-bit jitter.
If necessary, though, you can always bypass array bounds checks completely by using unsafe code and pointers (as svick points out). This technique is used by some performance-critical code in the Base Class Library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With