One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.
For starters here is the normal version of the function.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = a + b + 1;
}
}
return a;
}
Here is the version I made that breaks the functionality into small functions.
static double TinyFunctions()
{
double a = 0;
for (int i = 0; i < s_OuterLoopCount; i++)
{
a = Loop(a);
}
return a;
}
static double Loop(double a)
{
for (int i = 0; i < s_InnerLoopCount; i++)
{
double b = Double(i);
a = Add(a, Add(b, 1));
}
return a;
}
static double Double(double a)
{
return a * 2;
}
static double Add(double a, double b)
{
return a + b;
}
I use the stopwatch class to time the functions and when I ran it in debug I got the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;
These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;
These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?
We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.
I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.
In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.
I'd be grateful for any insight someone can provide.
Update 1 As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = b + 1 + a;
}
}
return a;
}
Here are the results with this configuration.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;
This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.
Furthermore, I then tried it with integer operations and we are back to not making any sense.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;
And this doesn't change regardless of the order of operations.
C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...
In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.
What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.
C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.
I can make performance match much better by changing one line of code:
a = a + b + 1;
Change it to:
a = b + 1 + a;
Or:
a += b + 1;
Now you'll find that NormalFunction
might actually be slightly faster and you can "fix" that by changing the signature of the Double
method to:
int Double( int a ) { return a * 2; }
I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions
being a few percent slower (as expected).
The second change is easy to explain: the NormalFunction
implementation actually doubles an int
and then converts it to a double
(with an fild
opcode at the machine code level). The original Double
method loads a double
first and then doubles it, which I would expect to be slightly slower.
But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:
Original Changed
01070620 push ebp 01390620 push ebp
01070621 mov ebp,esp 01390621 mov ebp,esp
01070623 push edi 01390623 push edi
01070624 push esi 01390624 push esi
01070625 push eax 01390625 push eax
01070626 fldz 01390626 fldz
01070628 xor esi,esi 01390628 xor esi,esi
0107062A mov edi,dword ptr ds:[0FE43ACh] 0139062A mov edi,dword ptr ds:[12243ACh]
01070630 test edi,edi 01390630 test edi,edi
01070632 jle 0107065A 01390632 jle 0139065A
01070634 xor edx,edx 01390634 xor edx,edx
01070636 mov ecx,dword ptr ds:[0FE43B0h] 01390636 mov ecx,dword ptr ds:[12243B0h]
0107063C test ecx,ecx 0139063C test ecx,ecx
0107063E jle 01070655 0139063E jle 01390655
01070640 mov eax,edx 01390640 mov eax,edx
01070642 add eax,eax 01390642 add eax,eax
01070644 mov dword ptr [ebp-0Ch],eax 01390644 mov dword ptr [ebp-0Ch],eax
01070647 fild dword ptr [ebp-0Ch] 01390647 fild dword ptr [ebp-0Ch]
0107064A faddp st(1),st 0139064A fld1
0107064C fld1 0139064C faddp st(1),st
0107064E faddp st(1),st 0139064E faddp st(1),st
01070650 inc edx 01390650 inc edx
01070651 cmp edx,ecx 01390651 cmp edx,ecx
01070653 jl 01070640 01390653 jl 01390640
01070655 inc esi 01390655 inc esi
01070656 cmp esi,edi 01390656 cmp esi,edi
01070658 jl 01070634 01390658 jl 01390634
0107065A pop ecx 0139065A pop ecx
0107065B pop esi 0139065B pop esi
0107065C pop edi 0139065C pop edi
0107065D pop ebp 0139065D pop ebp
0107065E ret 0139065E ret
Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.
With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:
int b = 2 * i;
a = a + b + 1;
Into something like:
mov esi, eax ; b = i
add esi, esi ; b += b
lea ecx, [ecx + esi + 1] ; a = a + b + 1
Where a
is stored in the ecx
register, i
in eax
, and b
in esi
.
Whereas the TinyFunctions
version gets turned into something like:
mov eax, edx
add eax, eax
inc eax
add ecx, eax
Where i
is in edx
, b
is in eax
, and a
is in ecx
this time around.
I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:
int b = 2 * i + 1;
a += b;
This ends up forcing the NormalFunction
approach to end up getting turned into mov, add, inc, add
as it appears in the TinyFunctions
approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With