Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C - have a simple loop that does arithmetic calculation; profiler shows this is a bottleneck. How to speed it up?

My first post here. A great site and resource.

I did search a bit and looked at the questions with similar titles, but couldn't find something specifically about this.

I'm trying to remove any redundancy and bloat from a C astronomical calculation library that my C++ program uses. I ran a simple profiler (VerySleepy).

Here is the code that the profiler showed as using the most time (aside from C library functions sprintf, etc.):

double swi_echeb(const double x, const double* const coef, const int ncf)
{
    int j = ncf - 1;
    double x2, br, brp2, brpp;
    x2 = x * 2.;
    br = 0.;
    brp2 = 0.;  /* dummy assign to silence gcc warning */
    brpp = 0.;

    for (; j >= 0; --j) {                 // <-- 0.39s
        brp2 = brpp;                      // <-- 0.01s
        brpp = br;                        // <-- 0.32s
        br = x2 * brpp - brp2 + coef[j];  // <-- 3.49s ***
    }                                     // <-- 0.14s

    return (br - brp2) * .5;              // <-- 0.06s
}                                         // <-- 0.05s

This particular function is deeply embedded within others and the main "kick-off" function that my program calls is called thousands of times.

You can see the standout statement with 3.49s as much higher than all the other statement times. I know there are ways to speed C arithmetic with using multiplication over division when possible. But I don't know much more than that.

Like:

  1. Would it be better to split this statement up into smaller pieces?:

    br = x2 * brpp;

    br -= brp2;

    br += coef[j];

Any other ideas or critiques. I did not write this code, though I did add the const to the function parameters as I love const-correctness.

I've never tried using registers or other fancy tricks to speed things up before. Anyone think something like that can work here?

I know people will say, "Try it!" So I will, and will update what I get if it helps anyone with similar arithmetic questions.

EDIT: Posting Results I've Tested From Suggestions

In order from fastest to slowest, here are what I've found so far. Profiler is VerySleepy. Compiler is Visual Studio 2008 Pro Ed. Compile options for both library and my application are:

Debug, C7 format, /O2 /Ob2 /Oi /Ot /Oy /GT /GL /GF /FD /MTd /GS- /Gy /fp:fast /FAs

The following is Andrew's suggestion about doing "4 iterations per loop". It was the fastest so far.

TOTAL TIME spent in function (times from the other statements in the function are not shown here) = 2.08 seconds

for (; index >= 3; index -= 4) {                    // 0.02s
    brp2    = brpp;
    brpp    = br;                                   // 0.02s
    br      = x2 * brpp - brp2 + coef[index];       // 0.25s
    brp2    = brpp;
    brpp    = br;                                   // 0.13s
    br      = x2 * brpp - brp2 + coef[index - 1];   // 0.33s
    brp2    = brpp;
    brpp    = br;                                   // 0.13s
    br      = x2 * brpp - brp2 + coef[index - 2];   // 0.34s
    brp2    = brpp;
    brpp    = br;                                   // 0.14s
    br      = x2 * brpp - brp2 + coef[index - 3];   // 0.42s
}

for (; index >= 0; --index) {                 // 0.03s
    brp2    = brpp;                           // 0.03s
    brpp    = br;
    br      = x2 * brpp - brp2 + coef[index]; // 0.11s
}

The next fastest was the original unaltered code, with a total time of 2.39 seconds inside the function, again including the statements outside the loop. Note that this is less than my original post. My original post was unoptimized code, but since everyone suggested it, all of my tests were subsequently as optimized as I could get in VS08:

for (j = ncf - 1; j >= 0; j--) {      // 0.02s
    brp2 = brpp;                      // 0.03s
    brpp = br;                        // 0.07s
    br = x2 * brpp - brp2 + coef[j];  // 2.14s
}

After this original code, the next fastest was Drew's idea of setting the pointer in advance and using that. Total time spent inside function was 2.49 seconds, including times from statements outside loop:

for (; index >= coef; --index) {         // 0.01s
    brp2    = brpp;
    brpp    = br;                        // 0.06s
    br      = x2 * brpp - brp2 + *index; // 2.24s
}

I also tried a mix of both Andrew's loop unrolling and Drew's pointer usage, but that took 2.39 seconds, the same as the unaltered code.

Based on the results, the loop-unrolling is the way to go so far for my usage.

like image 612
nedshares Avatar asked Dec 31 '11 19:12

nedshares


2 Answers

This appears to be a cache issue, not an arithmetic one.

for (; j >= 0; --j) {
    ...
    ... coef[j];
}

You're accessing an array here, and you are decrementing an index to do so. This action can really disrupt the cache-friendly locality inherent in a simple loop.

Is it possible to count forward? Ie,

for (int i = 0; i <= j; i++) {
    ...
    ... coef[i];
}

Would your calculation be valid?

like image 108
chrisaycock Avatar answered Oct 20 '22 22:10

chrisaycock


First thing I would try would be to iterate in steps of 4, ie: j+=4 (or in your case, j -=4) and semi-unroll the loop. The reason for this is it will help the compiler to make SSE optimisations and to batch memory access from main memory to cache. Just be aware that you will have to cater for the last few elements in case the loop count is not divisible by 4. For example:

// Disclaimer: I have not tested this code!
for (; j >= 3; j -= 4) {              
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j]; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j-1]; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j-2]; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j-3]; 
}                          
// if (j % 4) != 0 before the loop operation, 
// handle 1, 2 or 3 remaining elements here

Second thing I would try would be to preload coeff[j] into a register immediate prior to the calculation. The reason for this is floating point calculations are pipelined, meaning that a memory access in the wrong place can have adverse effects on performance. The calculation itself can be very fast but might take 14 instructions just to queue up the data from cache into the FPU. Add to that an access from main memory it can get even worse. For instance, try this (could also be tried with and without the -=4 unrolling)

// Disclaimer: I have not tested this code!
register double coef1, coef2, coef3, ceof4;
for (; j >= 3; j -= 4) {           
    coef1 = coef[j];    // Preloads the 4 sequential coeffs from 
    coef2 = coef[j-1];  // main memory to cache (if available)
    coef3 = coef[j-2];  
    coef4 = coef[j-3];  
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef1; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef2; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef3; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef4; 
} 

In this case the variables double x2, br, brp2, brpp, coef1, coef2, coef3, coef4 should be registers if at all possible.

Finally, using the above, can you apply SSE/SSE2 optimisation to it? Make sure this is enabled in the GCC compiler (I'm used to VS so the equivalent would be Release mode on, debug symbols off, optimization on, SSE2 on) and benchmark your code without the debugger attached. This alone can have a dramatic affect on performance.

Let us know the results. Performance tuning is trial and error!

Best of luck,

like image 43
Dr. Andrew Burnett-Thompson Avatar answered Oct 20 '22 21:10

Dr. Andrew Burnett-Thompson