Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C loop unrolling optimization performance

First: I know what loop optimization is and how it works yet I found a case where I cannot explain the results.

I created a prime number checker that calls modulo on each number from 2 to n - 1, so no algorithmical optimizations.

EDIT: I know that prime numbers can be calculated more efficiently but this is just an example for loop behaviour.

Then I created a normal and an optimized version:

#include <stdlib.h>
#include <stdio.h>

typedef unsigned long long natural;

int is_prime(natural n){
    int is_not_prime = 0;

    for(natural i = 2; i < n; i += 1){
        is_not_prime |= !!!(n % i);
    }

    if(is_not_prime){
        return 0;
    }else{
        return 1;
    }
}

//__attribute__((noinline))
int is_prime_opt(natural n){
    int is_not_prime = 0;

    for(natural i = 2; i < n; i += 8){
        is_not_prime |= !!(
                !(n % i) |
                !(n % i + 1) |
                !(n % i + 2) |
                !(n % i + 3) |
                !(n % i + 4) |
                !(n % i + 5) |
                !(n % i + 6) |
                !(n % i + 7));
    }

    if(is_not_prime){
        return 0;
    }else{
        return 1;
    }
}

int main(int argc, char *argv[])
{
    if(argc != 2)
        return 1;

    natural check = atoi(argv[1]);

    if(is_prime(check)){
        printf("%llu is prime\n", check);
    }

    return 0;
}

I compiled the code with gcc with -O3 to force all optimizations done by the compiler. Since the count of iterations is not known at compile time I expect the compiler not to unroll the loop. I created a second version that does the same in blocks of 8 numbers. Since some inputs are not dividable by 8 the loop calculates at worst 7 items to much but this is acceptable.

I checked the cycles with valgrind --tool=callgrind ./prime 100000000 with the following outputs:

unoptimized:

==983== Callgrind, a call-graph generating cache profiler
==983== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==983== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==983== Command: ./prime 100000000
==983== 
==983== For interactive control, run 'callgrind_control -h'.
==983== 
==983== Events    : Ir
==983== Collected : 1000098047
==983== 
==983== I   refs:      1,000,098,047

optimized:

==2307== Callgrind, a call-graph generating cache profiler
==2307== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==2307== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==2307== Command: ./prime 100000000
==2307== 
==2307== For interactive control, run 'callgrind_control -h'.
==2307== 
==2307== Events    : Ir
==2307== Collected : 137598072
==2307== 
==2307== I   refs:      137,598,072

I expected the loop to be 10-20% faster since I save 1/8th of jumps and checks. Also the branch prediction should already speed up the first version as all except the last jump go to the same direction.

What is unclear to me why it is faster by over 7 times? Since I called it with 100M I would expect it to do at least 100M - 3 (w/o 0, 1, n) modulo, or and negation operations but it needs only 1.37 cycles per element for this (and afaik modulo isn't a cheap operation).

like image 794
jklmnn Avatar asked Oct 19 '18 12:10

jklmnn


People also ask

How does loop unrolling improve performance?

Improved floating-point performance - loop unrolling can improve performance by providing the compiler more instructions to schedule across the unrolled iterations. This reduces the number of NOPs generated and also provides the compiler with a greater opportunity to generate parallel instructions.

How much faster is loop unrolling?

Unrolling WHILE loops In this case, unrolling is faster because the ENDWHILE (a jump to the start of the loop) will be executed 66% less often. Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.

Is loop unrolling always faster?

Unrolled loops are not always faster. They generate larger binaries. They require more instruction decoding. They use more memory and instruction cache.

What is software loop unrolling and why does it help to speedup execution?

Loop unrolling is a loop transformation technique that helps to optimize the execution time of a program. We basically remove or reduce iterations. Loop unrolling increases the program's speed by eliminating loop control instruction and loop test instructions.


1 Answers

!(n % i + 1) seems to be odd, n%i will result in 0 or a positive number, adding 1 will lead to a positive number, calculating ! on it will result in 0. So every !(n % i + XX) can be optimized away.

It should be !(n % (i + 1)).

like image 64
mch Avatar answered Nov 15 '22 04:11

mch