C# Performance on Small Functions

Tags:

One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.

For starters here is the normal version of the function.

static double NormalFunction()
{
    double a = 0;
    for (int j = 0; j < s_OuterLoopCount; ++j)
    {
        for (int i = 0; i < s_InnerLoopCount; ++i)
        {
            double b = i * 2;
            a = a + b + 1;
        }
    }
    return a;
}

Here is the version I made that breaks the functionality into small functions.

static double TinyFunctions()
{
    double a = 0;
    for (int i = 0; i < s_OuterLoopCount; i++)
    {
        a = Loop(a);
    }
    return a;
}
static double Loop(double a)
{
    for (int i = 0; i < s_InnerLoopCount; i++)
    {
        double b = Double(i);
        a = Add(a, Add(b, 1));
    }
    return a;
}
static double Double(double a)
{
    return a * 2;
}
static double Add(double a, double b)
{
    return a + b;
}

I use the stopwatch class to time the functions and when I ran it in debug I got the following results.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;

These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;

These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?

We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.

I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.

In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.

I'd be grateful for any insight someone can provide.

Update 1 As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.

static double NormalFunction()
{
    double a = 0;
    for (int j = 0; j < s_OuterLoopCount; ++j)
    {
        for (int i = 0; i < s_InnerLoopCount; ++i)
        {
            double b = i * 2;
            a = b + 1 + a;
        }
    }
    return a;
}

Here are the results with this configuration.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;

This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.

Furthermore, I then tried it with integer operations and we are back to not making any sense.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;

And this doesn't change regardless of the order of operations.

244

asked Jul 18 '17 17:07

Pumices

1 Answers

I can make performance match much better by changing one line of code:

a = a + b + 1;

Change it to:

a = b + 1 + a;

Or:

a += b + 1;

Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to:

int Double( int a ) { return a * 2; }

I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected).

The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower.

But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:

Original                                                    Changed
01070620  push        ebp                                   01390620  push        ebp  
01070621  mov         ebp,esp                               01390621  mov         ebp,esp  
01070623  push        edi                                   01390623  push        edi  
01070624  push        esi                                   01390624  push        esi  
01070625  push        eax                                   01390625  push        eax  
01070626  fldz                                              01390626  fldz  
01070628  xor         esi,esi                               01390628  xor         esi,esi  
0107062A  mov         edi,dword ptr ds:[0FE43ACh]           0139062A  mov         edi,dword ptr ds:[12243ACh]  
01070630  test        edi,edi                               01390630  test        edi,edi  
01070632  jle         0107065A                              01390632  jle         0139065A  
01070634  xor         edx,edx                               01390634  xor         edx,edx  
01070636  mov         ecx,dword ptr ds:[0FE43B0h]           01390636  mov         ecx,dword ptr ds:[12243B0h]  
0107063C  test        ecx,ecx                               0139063C  test        ecx,ecx  
0107063E  jle         01070655                              0139063E  jle         01390655  
01070640  mov         eax,edx                               01390640  mov         eax,edx  
01070642  add         eax,eax                               01390642  add         eax,eax  
01070644  mov         dword ptr [ebp-0Ch],eax               01390644  mov         dword ptr [ebp-0Ch],eax  
01070647  fild        dword ptr [ebp-0Ch]                   01390647  fild        dword ptr [ebp-0Ch]  
0107064A  faddp       st(1),st                              0139064A  fld1  
0107064C  fld1                                              0139064C  faddp       st(1),st  
0107064E  faddp       st(1),st                              0139064E  faddp       st(1),st  
01070650  inc         edx                                   01390650  inc         edx  
01070651  cmp         edx,ecx                               01390651  cmp         edx,ecx  
01070653  jl          01070640                              01390653  jl          01390640  
01070655  inc         esi                                   01390655  inc         esi  
01070656  cmp         esi,edi                               01390656  cmp         esi,edi  
01070658  jl          01070634                              01390658  jl          01390634  
0107065A  pop         ecx                                   0139065A  pop         ecx  
0107065B  pop         esi                                   0139065B  pop         esi  
0107065C  pop         edi                                   0139065C  pop         edi  
0107065D  pop         ebp                                   0139065D  pop         ebp  
0107065E  ret                                               0139065E  ret

Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.

Update:

With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:

int b = 2 * i;
a = a + b + 1;

Into something like:

mov esi, eax              ; b = i
add esi, esi              ; b += b
lea ecx, [ecx + esi + 1]  ; a = a + b + 1

Where a is stored in the ecx register, i in eax, and b in esi.

Whereas the TinyFunctions version gets turned into something like:

mov         eax, edx  
add         eax, eax  
inc         eax  
add         ecx, eax

Where i is in edx, b is in eax, and a is in ecx this time around.

I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:

int b = 2 * i + 1;
a += b;

This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.

answered Oct 06 '22 00:10

Kyle

Related questions
                            
                                WCF client-side error-handling
                            
                                PushStreamContent in asp.net 5 / mvc 6 is not working
                            
                                Create a Visual Studio Project Template that pulls NuGet references from online feed
                            
                                CGPDFDocument unable to read pdf
                            
                                Avoid repeating the defaults of an interface
                            
                                Check if a type belongs to a namespace without hardcoded strings
                            
                                Windows Application - handle taskkill
                            
                                Dependency injection with constructor parameter in .net core
                            
                                How to create .NET Platform Standard project
                            
                                Supress/Disable/Solve Visual Studio Build Warning
                            
                                Generic function declaration in C#
                            
                                how ASP.NET Core execute Linux shell command?
                            
                                C# - Make WCF Accept any Soap message prefixes
                            
                                How to remove image as attachment but show in body of email
                            
                                Can variables declared inside a for loop affect the performance of the loop?
                            
                                Web Pushnotification 'UnauthorizedRegistration' or 'Gone' or 'Unauthorized'- subscription expires
                            
                                Is there a difference between `x is int?` and `x is int` in C#?
                            
                                Polly timeout policy clarification
                            
                                Getting custom rss feed item element with syndicationitem?
                            
                                System.Uri and encoded colon (:)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C# Performance on Small Functions

Tags:

performance

c#

compilation

roslyn

jit

Pumices

People also ask

1 Answers

Update:

Kyle

Recent Activity

Donate For Us