Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Performance on Small Functions

One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.

For starters here is the normal version of the function.

static double NormalFunction()
{
    double a = 0;
    for (int j = 0; j < s_OuterLoopCount; ++j)
    {
        for (int i = 0; i < s_InnerLoopCount; ++i)
        {
            double b = i * 2;
            a = a + b + 1;
        }
    }
    return a;
}

Here is the version I made that breaks the functionality into small functions.

static double TinyFunctions()
{
    double a = 0;
    for (int i = 0; i < s_OuterLoopCount; i++)
    {
        a = Loop(a);
    }
    return a;
}
static double Loop(double a)
{
    for (int i = 0; i < s_InnerLoopCount; i++)
    {
        double b = Double(i);
        a = Add(a, Add(b, 1));
    }
    return a;
}
static double Double(double a)
{
    return a * 2;
}
static double Add(double a, double b)
{
    return a + b;
}

I use the stopwatch class to time the functions and when I ran it in debug I got the following results.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;

These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;

These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?

We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.

I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.

In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.

I'd be grateful for any insight someone can provide.

Update 1 As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.

static double NormalFunction()
{
    double a = 0;
    for (int j = 0; j < s_OuterLoopCount; ++j)
    {
        for (int i = 0; i < s_InnerLoopCount; ++i)
        {
            double b = i * 2;
            a = b + 1 + a;
        }
    }
    return a;
}

Here are the results with this configuration.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;

This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.

Furthermore, I then tried it with integer operations and we are back to not making any sense.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;

And this doesn't change regardless of the order of operations.

like image 244
Pumices Avatar asked Jul 18 '17 17:07

Pumices


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.


1 Answers

I can make performance match much better by changing one line of code:

a = a + b + 1;

Change it to:

a = b + 1 + a;

Or:

a += b + 1;

Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to:

int Double( int a ) { return a * 2; }

I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected).

The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower.

But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:

Original                                                    Changed
01070620  push        ebp                                   01390620  push        ebp  
01070621  mov         ebp,esp                               01390621  mov         ebp,esp  
01070623  push        edi                                   01390623  push        edi  
01070624  push        esi                                   01390624  push        esi  
01070625  push        eax                                   01390625  push        eax  
01070626  fldz                                              01390626  fldz  
01070628  xor         esi,esi                               01390628  xor         esi,esi  
0107062A  mov         edi,dword ptr ds:[0FE43ACh]           0139062A  mov         edi,dword ptr ds:[12243ACh]  
01070630  test        edi,edi                               01390630  test        edi,edi  
01070632  jle         0107065A                              01390632  jle         0139065A  
01070634  xor         edx,edx                               01390634  xor         edx,edx  
01070636  mov         ecx,dword ptr ds:[0FE43B0h]           01390636  mov         ecx,dword ptr ds:[12243B0h]  
0107063C  test        ecx,ecx                               0139063C  test        ecx,ecx  
0107063E  jle         01070655                              0139063E  jle         01390655  
01070640  mov         eax,edx                               01390640  mov         eax,edx  
01070642  add         eax,eax                               01390642  add         eax,eax  
01070644  mov         dword ptr [ebp-0Ch],eax               01390644  mov         dword ptr [ebp-0Ch],eax  
01070647  fild        dword ptr [ebp-0Ch]                   01390647  fild        dword ptr [ebp-0Ch]  
0107064A  faddp       st(1),st                              0139064A  fld1  
0107064C  fld1                                              0139064C  faddp       st(1),st  
0107064E  faddp       st(1),st                              0139064E  faddp       st(1),st  
01070650  inc         edx                                   01390650  inc         edx  
01070651  cmp         edx,ecx                               01390651  cmp         edx,ecx  
01070653  jl          01070640                              01390653  jl          01390640  
01070655  inc         esi                                   01390655  inc         esi  
01070656  cmp         esi,edi                               01390656  cmp         esi,edi  
01070658  jl          01070634                              01390658  jl          01390634  
0107065A  pop         ecx                                   0139065A  pop         ecx  
0107065B  pop         esi                                   0139065B  pop         esi  
0107065C  pop         edi                                   0139065C  pop         edi  
0107065D  pop         ebp                                   0139065D  pop         ebp  
0107065E  ret                                               0139065E  ret  

Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.

Update:

With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:

int b = 2 * i;
a = a + b + 1;

Into something like:

mov esi, eax              ; b = i
add esi, esi              ; b += b
lea ecx, [ecx + esi + 1]  ; a = a + b + 1

Where a is stored in the ecx register, i in eax, and b in esi.

Whereas the TinyFunctions version gets turned into something like:

mov         eax, edx  
add         eax, eax  
inc         eax  
add         ecx, eax  

Where i is in edx, b is in eax, and a is in ecx this time around.

I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:

int b = 2 * i + 1;
a += b;

This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.

like image 73
Kyle Avatar answered Oct 06 '22 00:10

Kyle