Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What GCC Optimization Is Occurring In This For Loop?

Tags:

c++

gcc

Using gcc 4.6 with -O3, I have timed the following four codes using the simple time command

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0; 
  unsigned int numIterations = 1e7; 
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
  std::cout<<val<<std::endl;
}

Case 1 runs in 0.09 seconds

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
  std::cout<<val<<std::endl;
}

Case 2 runs in 17.6 seconds

int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999;
  }
}

Case 3 runs in 0.8 seconds

#include <iostream>
int main(int argc, char* argv[])
{
  double val = 1.0;
  unsigned int numIterations = 1e8;
  for(unsigned int ii = 0;ii < numIterations;++ii) {
    val *= 0.999999;
  }
  std::cout<<val<<std::endl;
}

Case 4 runs in 0.8 seconds

My question is why is the second case so much slower than all the other cases? Case 3 shows that removing the cout brings the runtime back in line with what is expected. And Case 4 shows that changing the multiplier also greatly reduces the runtime. What optimization or optimizations are not occuring in case 2 and why?

Update:

When I originally ran these tests there was no separate variable numIterations, the value was hard-coded in the for loop. In general, hard-coding this value made things run slower than the cases given here. This is especially true for Case 3 which ran almost instantly with the numIterations variable as shown above, indicating James McNellis is correct about the entire loop being optimized out. I'm not sure why hard-coding the 1e8 into the for loop prevents the removal of the loop in Case 3 or makes things slower in the other cases, however, the basic premise of Case 2 being significantly slower is even more true.

Diffing the assembly output gives for the cases above gives

Case 2 and Case 1:
movl $100000000, 16(%esp)


movl $10000000, 16(%esp)

Case 2 and Case 4:
.long -652835029
.long 1072691150


.long -417264663
.long 1072693245

like image 796
user334066 Avatar asked Jul 04 '11 16:07

user334066


People also ask

What optimization does GCC do?

GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O , this option increases both compilation time and the performance of the generated code. Please note the warning under -fgcse about invoking -O2 on programs that use computed gotos.

Is GCC an optimizing compiler?

By default, GCC compiles code that is optimized for the most processors. However, you can use the -mtune and -march options with gcc to optimize instruction scheduling and instruction set selection respectively.

What optimization does a compiler do?

In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power consumption (the last three being popular for portable computers).


2 Answers

René Richter was on the right track regarding underflow. The smallest positive normalized number is about 2.2e-308. With f(n)=0.999**n, this limit is reached after about 708,148 iterations. The remaining iterations are stuck with unnormalized computations.

This explains why 100 million iterations take slightly more than 10 times the time needed to perform 10 million. The first 700,000 are done using the floating point hardware. Once you hit denormalized numbers, the floating point hardware punts; the multiplication is done in software.

Note that this would not be the case if the repeated computation properly calculated 0.999**N. Eventually the product would reach zero, and from that point on the multiplications would once again be done with the floating point hardware. That is not what happens because 0.999 * (smallest denormalized number) is the smallest denormalized number. The continued product eventually bottoms out.

What we can do is change the exponent. An exponent of 0.999999 will keep the continued product within the realm of normalized numbers for 708 million iterations. Taking advantage of this,

Case A  : #iterations = 1e7, exponent=0.999, time=0.573692 sec
Case B  : #iterations = 1e8, exponent=0.999, time=6.10548 sec
Case C  : #iterations = 1e7, exponent=0.999999, time=0.018867 sec
Case D  : #iterations = 1e8, exponent=0.999999, time=0.189375 sec

Here you can easily see how much faster the floating point hardware is than is the software emulation.

like image 160
David Hammen Avatar answered Oct 26 '22 09:10

David Hammen


Compile with option -S and look at the generated assembler output (files named *.s).

Edit:

In Program 3 the loop is removed since the result is not used.

For case 1, 2 and 4 let's do some math: The base 10 logarithm of the result of case 1 is 1e7 * log10(0.999) = -4345 (roughly). For case 2 we get 1e8*log10(0.999) = -43451. For case 4 it is 1e8*log10(0.9999) = -4343. The result itself is pow(10, logarithm).

The floating point unit uses 80 bit long doubles internally on x86/x64 cpus. When the number becomes smaller than 1.9E-4951 we get a floating point underflow as @James Kanze pointed out. This happens only in case 2. I don't know why this takes longer than a normalized result, maybe someone else can explain.

like image 30
René Richter Avatar answered Oct 26 '22 09:10

René Richter