First: I know what loop optimization is and how it works yet I found a case where I cannot explain the results.
I created a prime number checker that calls modulo on each number from 2 to n - 1, so no algorithmical optimizations.
EDIT: I know that prime numbers can be calculated more efficiently but this is just an example for loop behaviour.
Then I created a normal and an optimized version:
#include <stdlib.h>
#include <stdio.h>
typedef unsigned long long natural;
int is_prime(natural n){
int is_not_prime = 0;
for(natural i = 2; i < n; i += 1){
is_not_prime |= !!!(n % i);
}
if(is_not_prime){
return 0;
}else{
return 1;
}
}
//__attribute__((noinline))
int is_prime_opt(natural n){
int is_not_prime = 0;
for(natural i = 2; i < n; i += 8){
is_not_prime |= !!(
!(n % i) |
!(n % i + 1) |
!(n % i + 2) |
!(n % i + 3) |
!(n % i + 4) |
!(n % i + 5) |
!(n % i + 6) |
!(n % i + 7));
}
if(is_not_prime){
return 0;
}else{
return 1;
}
}
int main(int argc, char *argv[])
{
if(argc != 2)
return 1;
natural check = atoi(argv[1]);
if(is_prime(check)){
printf("%llu is prime\n", check);
}
return 0;
}
I compiled the code with gcc with -O3
to force all optimizations done by the compiler. Since the count of iterations is not known at compile time I expect the compiler not to unroll the loop.
I created a second version that does the same in blocks of 8 numbers. Since some inputs are not dividable by 8 the loop calculates at worst 7 items to much but this is acceptable.
I checked the cycles with valgrind --tool=callgrind ./prime 100000000
with the following outputs:
unoptimized:
==983== Callgrind, a call-graph generating cache profiler
==983== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==983== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==983== Command: ./prime 100000000
==983==
==983== For interactive control, run 'callgrind_control -h'.
==983==
==983== Events : Ir
==983== Collected : 1000098047
==983==
==983== I refs: 1,000,098,047
optimized:
==2307== Callgrind, a call-graph generating cache profiler
==2307== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==2307== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==2307== Command: ./prime 100000000
==2307==
==2307== For interactive control, run 'callgrind_control -h'.
==2307==
==2307== Events : Ir
==2307== Collected : 137598072
==2307==
==2307== I refs: 137,598,072
I expected the loop to be 10-20% faster since I save 1/8th of jumps and checks. Also the branch prediction should already speed up the first version as all except the last jump go to the same direction.
What is unclear to me why it is faster by over 7 times? Since I called it with 100M I would expect it to do at least 100M - 3 (w/o 0, 1, n) modulo, or and negation operations but it needs only 1.37 cycles per element for this (and afaik modulo isn't a cheap operation).
Improved floating-point performance - loop unrolling can improve performance by providing the compiler more instructions to schedule across the unrolled iterations. This reduces the number of NOPs generated and also provides the compiler with a greater opportunity to generate parallel instructions.
Unrolling WHILE loops In this case, unrolling is faster because the ENDWHILE (a jump to the start of the loop) will be executed 66% less often. Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.
Unrolled loops are not always faster. They generate larger binaries. They require more instruction decoding. They use more memory and instruction cache.
Loop unrolling is a loop transformation technique that helps to optimize the execution time of a program. We basically remove or reduce iterations. Loop unrolling increases the program's speed by eliminating loop control instruction and loop test instructions.
!(n % i + 1)
seems to be odd, n%i
will result in 0
or a positive number, adding 1
will lead to a positive number, calculating !
on it will result in 0
. So every !(n % i + XX)
can be optimized away.
It should be !(n % (i + 1))
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With