First: I know what loop optimization is and how it works yet I found a case where I cannot explain the results. I created a prime number checker that calls modulo on each number from 2 to n - 1, so no algorithmical optimizations. EDIT: I know that prime numbers can be calculated more efficiently but this is just an example for loop behaviour. Then I created a normal and an optimized version: <pre class="prettyprint"><code>#include <stdlib.h> #include <stdio.h> typedef unsigned long long natural; int is_prime(natural n){ int is_not_prime = 0; for(natural i = 2; i < n; i += 1){ is_not_prime |= !!!(n % i); } if(is_not_prime){ return 0; }else{ return 1; } } //__attribute__((noinline)) int is_prime_opt(natural n){ int is_not_prime = 0; for(natural i = 2; i < n; i += 8){ is_not_prime |= !!( !(n % i) | !(n % i + 1) | !(n % i + 2) | !(n % i + 3) | !(n % i + 4) | !(n % i + 5) | !(n % i + 6) | !(n % i + 7)); } if(is_not_prime){ return 0; }else{ return 1; } } int main(int argc, char *argv[]) { if(argc != 2) return 1; natural check = atoi(argv[1]); if(is_prime(check)){ printf("%llu is prime\n", check); } return 0; } </code></pre> I compiled the code with gcc with <code>-O3</code> to force all optimizations done by the compiler. Since the count of iterations is not known at compile time I expect the compiler not to unroll the loop. I created a second version that does the same in blocks of 8 numbers. Since some inputs are not dividable by 8 the loop calculates at worst 7 items to much but this is acceptable. I checked the cycles with <code>valgrind --tool=callgrind ./prime 100000000</code> with the following outputs: unoptimized: <pre class="prettyprint"><code>==983== Callgrind, a call-graph generating cache profiler ==983== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al. ==983== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info ==983== Command: ./prime 100000000 ==983== ==983== For interactive control, run 'callgrind_control -h'. ==983== ==983== Events : Ir ==983== Collected : 1000098047 ==983== ==983== I refs: 1,000,098,047 </code></pre> optimized: <pre class="prettyprint"><code>==2307== Callgrind, a call-graph generating cache profiler ==2307== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al. ==2307== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info ==2307== Command: ./prime 100000000 ==2307== ==2307== For interactive control, run 'callgrind_control -h'. ==2307== ==2307== Events : Ir ==2307== Collected : 137598072 ==2307== ==2307== I refs: 137,598,072 </code></pre> I expected the loop to be 10-20% faster since I save 1/8th of jumps and checks. Also the branch prediction should already speed up the first version as all except the last jump go to the same direction. What is unclear to me why it is faster by over 7 times? Since I called it with 100M I would expect it to do at least 100M - 3 (w/o 0, 1, n) modulo, or and negation operations but it needs only 1.37 cycles per element for this (and afaik modulo isn't a cheap operation).

<code>!(n % i + 1)</code> seems to be odd, <code>n%i</code> will result in <code>0</code> or a positive number, adding <code>1</code> will lead to a positive number, calculating <code>!</code> on it will result in <code>0</code>. So every <code>!(n % i + XX)</code> can be optimized away. It should be <code>!(n % (i + 1))</code>.

C loop unrolling optimization performance

Tags:

c

loops

optimization

loop-unrolling

First: I know what loop optimization is and how it works yet I found a case where I cannot explain the results.

I created a prime number checker that calls modulo on each number from 2 to n - 1, so no algorithmical optimizations.

EDIT: I know that prime numbers can be calculated more efficiently but this is just an example for loop behaviour.

Then I created a normal and an optimized version:

#include <stdlib.h>
#include <stdio.h>

typedef unsigned long long natural;

int is_prime(natural n){
    int is_not_prime = 0;

    for(natural i = 2; i < n; i += 1){
        is_not_prime |= !!!(n % i);
    }

    if(is_not_prime){
        return 0;
    }else{
        return 1;
    }
}

//__attribute__((noinline))
int is_prime_opt(natural n){
    int is_not_prime = 0;

    for(natural i = 2; i < n; i += 8){
        is_not_prime |= !!(
                !(n % i) |
                !(n % i + 1) |
                !(n % i + 2) |
                !(n % i + 3) |
                !(n % i + 4) |
                !(n % i + 5) |
                !(n % i + 6) |
                !(n % i + 7));
    }

    if(is_not_prime){
        return 0;
    }else{
        return 1;
    }
}

int main(int argc, char *argv[])
{
    if(argc != 2)
        return 1;

    natural check = atoi(argv[1]);

    if(is_prime(check)){
        printf("%llu is prime\n", check);
    }

    return 0;
}

I compiled the code with gcc with -O3 to force all optimizations done by the compiler. Since the count of iterations is not known at compile time I expect the compiler not to unroll the loop. I created a second version that does the same in blocks of 8 numbers. Since some inputs are not dividable by 8 the loop calculates at worst 7 items to much but this is acceptable.

I checked the cycles with valgrind --tool=callgrind ./prime 100000000 with the following outputs:

unoptimized:

==983== Callgrind, a call-graph generating cache profiler
==983== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==983== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==983== Command: ./prime 100000000
==983== 
==983== For interactive control, run 'callgrind_control -h'.
==983== 
==983== Events    : Ir
==983== Collected : 1000098047
==983== 
==983== I   refs:      1,000,098,047

optimized:

==2307== Callgrind, a call-graph generating cache profiler
==2307== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==2307== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==2307== Command: ./prime 100000000
==2307== 
==2307== For interactive control, run 'callgrind_control -h'.
==2307== 
==2307== Events    : Ir
==2307== Collected : 137598072
==2307== 
==2307== I   refs:      137,598,072

I expected the loop to be 10-20% faster since I save 1/8th of jumps and checks. Also the branch prediction should already speed up the first version as all except the last jump go to the same direction.

What is unclear to me why it is faster by over 7 times? Since I called it with 100M I would expect it to do at least 100M - 3 (w/o 0, 1, n) modulo, or and negation operations but it needs only 1.37 cycles per element for this (and afaik modulo isn't a cheap operation).

794

asked Oct 19 '18 12:10

jklmnn

1 Answers

!(n % i + 1) seems to be odd, n%i will result in 0 or a positive number, adding 1 will lead to a positive number, calculating ! on it will result in 0. So every !(n % i + XX) can be optimized away.

It should be !(n % (i + 1)).

answered Nov 15 '22 04:11

mch

Related questions
                            
                                GLFW Mouse event lag with window drag
                            
                                How to tell gcc that the data pointed to by a pointer will always be aligned?
                            
                                Typedef with arrays in C
                            
                                How to get RGB pixel values from Linux framebuffer?
                            
                                When calling the exec*() family of functions, do the char* elements of argv all have to be unique?
                            
                                Where is the 'main' function located? [closed]
                            
                                How to prevent system hang before watchdog timer task kicks in
                            
                                Difference between a process signal mask, blocked signal set, and a blocked signal?
                            
                                sem_post, signal handlers, and undefined behavior
                            
                                Typedefine an enum type as structure
                            
                                C: Putting a function pointer in a structure where the function uses that structure as an argument
                            
                                Integer promotion example in C [duplicate]
                            
                                PHP Internals: How does TSRMLS_FETCH Work?
                            
                                Is there any atomic set operation that returns the old value in C?
                            
                                Why can't I do strcpy?
                            
                                How does spin_lock_bh() work?
                            
                                How to define a c macro with multiple statement
                            
                                Where to put things (CFLAGS or CXXFLAGS) on Makefile?
                            
                                Are there advantages of using `setp` instead of `setb`?
                            
                                using enum values in preprocessor directive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With