Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

Tags:

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3), and I have been wondering ever since why.

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.

const int LOOP_BOUND = 200000000;  __attribute__((noinline)) static int add(const int& x, const int& y) {     return x + y; }  __attribute__((noinline)) static int work(int xval, int yval) {     int sum(0);     for (int i=0; i<LOOP_BOUND; ++i) {         int x(xval+sum);         int y(yval+sum);         int z = add(x, y);         sum += z;     }     return sum; }  int main(int , char* argv[]) {     int result = work(*argv[1], *argv[2]);     return result; }

If I compile it with -Os, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2 or -O3. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).

(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-* flags have the same effect.)

Here is the generated assembly with -Os and -O2.

Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines, result here. This code still runs in 0.38s and the only difference is the .p2align stuff.

If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.

Is it the padding that is the culprit in this case? Why and how?

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?

UPDATE:

Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops to gcc, all .p2align are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:
  -falign-functions  -falign-jumps  -falign-loops   -falign-labels  -freorder-blocks  -freorder-blocks-and-partition   -fprefetch-loop-arrays 

So, it pretty much seems like a (mis)alignment issue.

I am still skeptical about -march=native as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)

UPDATE 2:

We can take -Os out of the picture. The following times are obtained by compiling with

-O2 -fno-omit-frame-pointer 0.37s
-O2 -fno-align-functions -fno-align-loops 0.37s
-S -O2 then manually moving the assembly of add() after work() 0.37s
-O2 0.44s

It looks like to me the distance of add() from the call site matters a lot. I have tried perf, but the output of perf stat and perf report makes very little sense to me. However, I could only get one consistent result out of it:

-O2:

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle        3,318 cache-misses  0.432703993 seconds time elapsed  [...]  81.23%  a.out  a.out              [.] work(int, int)  18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]  [...]        ¦   __attribute__((noinline))        ¦   static int add(const int& x, const int& y) {        ¦       return x + y; 100.00 ¦     lea    (%rdi,%rsi,1),%eax        ¦   }        ¦   ? retq [...]        ¦            int z = add(x, y);   1.93 ¦    ? callq  add(int const&, int const&) [clone .isra.0]        ¦            sum += z;  79.79 ¦      add    %eax,%ebx

For fno-align-*:

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle        9,508 cache-misses  0.375681928 seconds time elapsed  [...]  82.58%  a.out  a.out              [.] work(int, int)  16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]  [...]        ¦   __attribute__((noinline))        ¦   static int add(const int& x, const int& y) {        ¦       return x + y;  51.59 ¦     lea    (%rdi,%rsi,1),%eax        ¦   } [...]        ¦    __attribute__((noinline))        ¦    static int work(int xval, int yval) {        ¦        int sum(0);        ¦        for (int i=0; i<LOOP_BOUND; ++i) {        ¦            int x(xval+sum);   8.20 ¦      lea    0x0(%r13,%rbx,1),%edi        ¦            int y(yval+sum);        ¦            int z = add(x, y);  35.34 ¦    ? callq  add(int const&, int const&) [clone .isra.0]        ¦            sum += z;  39.48 ¦      add    %eax,%ebx        ¦    }

For -fno-omit-frame-pointer:

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle       10,514 cache-misses  0.375445137 seconds time elapsed  [...]  75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     ¦  24.46%  a.out  a.out              [.] work(int, int)  [...]        ¦   __attribute__((noinline))        ¦   static int add(const int& x, const int& y) {  18.67 ¦     push   %rbp        ¦       return x + y;  18.49 ¦     lea    (%rdi,%rsi,1),%eax        ¦   const int LOOP_BOUND = 200000000;        ¦        ¦   __attribute__((noinline))        ¦   static int add(const int& x, const int& y) {        ¦     mov    %rsp,%rbp        ¦       return x + y;        ¦   }  12.71 ¦     pop    %rbp        ¦   ? retq  [...]        ¦            int z = add(x, y);        ¦    ? callq  add(int const&, int const&) [clone .isra.0]        ¦            sum += z;  29.83 ¦      add    %eax,%ebx

It looks like we are stalling on the call to add() in the slow case.

I have examined everything that perf -e can spit out on my machine; not just the stats that are given above.

For the same executable, the stalled-cycles-frontend shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend for different executables doesn't make sense to me.)

I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.

492

asked Oct 19 '13 20:10

Ali

1 Answers

By default compilers optimize for "average" processor. Since different processors favor different instruction sequences, compiler optimizations enabled by -O2 might benefit average processor, but decrease performance on your particular processor (and the same applies to -Os). If you try the same example on different processors, you will find that on some of them benefit from -O2 while other are more favorable to -Os optimizations.

Here are the results for time ./test 0 0 on several processors (user time reported):

Processor (System-on-Chip)             Compiler   Time (-O2)  Time (-Os)  Fastest AMD Opteron 8350                       gcc-4.8.1    0.704s      0.896s      -O2 AMD FX-6300                            gcc-4.8.1    0.392s      0.340s      -Os AMD E2-1800                            gcc-4.7.2    0.740s      0.832s      -O2 Intel Xeon E5405                       gcc-4.8.1    0.603s      0.804s      -O2 Intel Xeon E5-2603                     gcc-4.4.7    1.121s      1.122s       - Intel Core i3-3217U                    gcc-4.6.4    0.709s      0.709s       - Intel Core i3-3217U                    gcc-4.7.3    0.708s      0.822s      -O2 Intel Core i3-3217U                    gcc-4.8.1    0.708s      0.944s      -O2 Intel Core i7-4770K                    gcc-4.8.1    0.296s      0.288s      -Os Intel Atom 330                         gcc-4.8.1    2.003s      2.007s      -O2 ARM 1176JZF-S (Broadcom BCM2835)       gcc-4.6.3    3.470s      3.480s      -O2 ARM Cortex-A8 (TI OMAP DM3730)         gcc-4.6.3    2.727s      2.727s       - ARM Cortex-A9 (TI OMAP 4460)           gcc-4.6.3    1.648s      1.648s       - ARM Cortex-A9 (Samsung Exynos 4412)    gcc-4.6.3    1.250s      1.250s       - ARM Cortex-A15 (Samsung Exynos 5250)   gcc-4.7.2    0.700s      0.700s       - Qualcomm Snapdragon APQ8060A           gcc-4.8       1.53s       1.52s      -Os

In some cases you can alleviate the effect of disadvantageous optimizations by asking gcc to optimize for your particular processor (using options -mtune=native or -march=native):

Processor            Compiler   Time (-O2 -mtune=native) Time (-Os -mtune=native) AMD FX-6300          gcc-4.8.1         0.340s                   0.340s AMD E2-1800          gcc-4.7.2         0.740s                   0.832s Intel Xeon E5405     gcc-4.8.1         0.603s                   0.803s Intel Core i7-4770K  gcc-4.8.1         0.296s                   0.288s

Update: on Ivy Bridge-based Core i3 three versions of gcc (4.6.4, 4.7.3, and 4.8.1) produce binaries with significantly different performance, but the assembly code has only subtle variations. So far, I have no explanation of this fact.

Assembly from gcc-4.6.4 -Os (executes in 0.709 secs):

00000000004004d2 <_ZL3addRKiS0_.isra.0>:   4004d2:       8d 04 37                lea    eax,[rdi+rsi*1]   4004d5:       c3                      ret  00000000004004d6 <_ZL4workii>:   4004d6:       41 55                   push   r13   4004d8:       41 89 fd                mov    r13d,edi   4004db:       41 54                   push   r12   4004dd:       41 89 f4                mov    r12d,esi   4004e0:       55                      push   rbp   4004e1:       bd 00 c2 eb 0b          mov    ebp,0xbebc200   4004e6:       53                      push   rbx   4004e7:       31 db                   xor    ebx,ebx   4004e9:       41 8d 34 1c             lea    esi,[r12+rbx*1]   4004ed:       41 8d 7c 1d 00          lea    edi,[r13+rbx*1+0x0]   4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>   4004f7:       01 c3                   add    ebx,eax   4004f9:       ff cd                   dec    ebp   4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>   4004fd:       89 d8                   mov    eax,ebx   4004ff:       5b                      pop    rbx   400500:       5d                      pop    rbp   400501:       41 5c                   pop    r12   400503:       41 5d                   pop    r13   400505:       c3                      ret

Assembly from gcc-4.7.3 -Os (executes in 0.822 secs):

00000000004004fa <_ZL3addRKiS0_.isra.0>:   4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]   4004fd:       c3                      ret  00000000004004fe <_ZL4workii>:   4004fe:       41 55                   push   r13   400500:       41 89 f5                mov    r13d,esi   400503:       41 54                   push   r12   400505:       41 89 fc                mov    r12d,edi   400508:       55                      push   rbp   400509:       bd 00 c2 eb 0b          mov    ebp,0xbebc200   40050e:       53                      push   rbx   40050f:       31 db                   xor    ebx,ebx   400511:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]   400516:       41 8d 3c 1c             lea    edi,[r12+rbx*1]   40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>   40051f:       01 c3                   add    ebx,eax   400521:       ff cd                   dec    ebp   400523:       75 ec                   jne    400511 <_ZL4workii+0x13>   400525:       89 d8                   mov    eax,ebx   400527:       5b                      pop    rbx   400528:       5d                      pop    rbp   400529:       41 5c                   pop    r12   40052b:       41 5d                   pop    r13   40052d:       c3                      ret

Assembly from gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>:   4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]   400500:       c3                      ret  0000000000400501 <_ZL4workii>:   400501:       41 55                   push   r13   400503:       41 89 f5                mov    r13d,esi   400506:       41 54                   push   r12   400508:       41 89 fc                mov    r12d,edi   40050b:       55                      push   rbp   40050c:       bd 00 c2 eb 0b          mov    ebp,0xbebc200   400511:       53                      push   rbx   400512:       31 db                   xor    ebx,ebx   400514:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]   400519:       41 8d 3c 1c             lea    edi,[r12+rbx*1]   40051d:       e8 db ff ff ff          call   4004fd <_ZL3addRKiS0_.isra.0>   400522:       01 c3                   add    ebx,eax   400524:       ff cd                   dec    ebp   400526:       75 ec                   jne    400514 <_ZL4workii+0x13>   400528:       89 d8                   mov    eax,ebx   40052a:       5b                      pop    rbx   40052b:       5d                      pop    rbp   40052c:       41 5c                   pop    r12   40052e:       41 5d                   pop    r13   400530:       c3                      ret

107

answered Oct 25 '22 19:10

Marat Dukhan

Related questions
                            
                                What are copy elision and return value optimization?
                            
                                How do I pass a unique_ptr argument to a constructor or a function?
                            
                                Static linking vs dynamic linking
                            
                                Is there any advantage of using map over unordered_map in case of trivial keys?
                            
                                How do you get assembler output from C/C++ source in gcc?
                            
                                What are the barriers to understanding pointers and what can be done to overcome them? [closed]
                            
                                Is there a standard sign function (signum, sgn) in C/C++?
                            
                                enum to string in modern C++11 / C++14 / C++17 and future C++20
                            
                                How many levels of pointers can we have?
                            
                                When to use dynamic vs. static libraries
                            
                                Why is this program erroneously rejected by three C++ compilers?
                            
                                What is the meaning of prepended double colon "::"?
                            
                                Use of 'const' for function parameters
                            
                                What uses are there for "placement new"?
                            
                                What are the differences between struct and class in C++?
                            
                                What is Linux’s native GUI API?
                            
                                How does the compilation/linking process work?
                            
                                What's the difference between "STL" and "C++ Standard Library"?
                            
                                What is the difference between float and double?
                            
                                C++11 rvalues and move semantics confusion (return statement)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

Tags:

c++

performance

compiler-optimization

gcc

x86-64

Ali

People also ask

1 Answers

Marat Dukhan

Recent Activity

Donate For Us