I've been playing around a little bit with x86-64 assembly trying to learn more about the various SIMD extensions that are available (MMX, SSE, AVX). In order to see how different C or C++ constructs are translated into machine code by GCC I've been using Compiler Explorer which is a superb tool. During one of my 'play sessions' I wanted to see how GCC could optimize a simple run-time initialization of an integer array. In this case I tried to write the numbers 0 to 2047 to an array of 2048 unsigned integers. The code looks as follows: <pre class="prettyprint"><code>unsigned int buffer[2048]; void setup() { for (unsigned int i = 0; i < 2048; ++i) { buffer[i] = i; } } </code></pre> If I enable optimizations and AVX-512 instructions <code>-O3 -mavx512f -mtune=intel</code> GCC 6.3 generates some really clever code :) <pre class="prettyprint"><code>setup(): mov eax, OFFSET FLAT:buffer mov edx, OFFSET FLAT:buffer+8192 vmovdqa64 zmm0, ZMMWORD PTR .LC0[rip] vmovdqa64 zmm1, ZMMWORD PTR .LC1[rip] .L2: vmovdqa64 ZMMWORD PTR [rax], zmm0 add rax, 64 cmp rdx, rax vpaddd zmm0, zmm0, zmm1 jne .L2 ret buffer: .zero 8192 .LC0: .long 0 .long 1 .long 2 .long 3 .long 4 .long 5 .long 6 .long 7 .long 8 .long 9 .long 10 .long 11 .long 12 .long 13 .long 14 .long 15 .LC1: .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 </code></pre> However, when I tested what would be generated if the same code was compiled using the GCC C-compiler by adding the flags <code>-x c</code> I was really surprised. I expected similar, if not identical, results but the C-compiler seems to generate much more complicated and presumably also much slower machine code. The resulting assembly is too large to paste here in full, but it can be viewed at godbolt.org by following this link. A snippet of the generated code, lines 58 to 83, can be seen below: <pre class="prettyprint"><code>.L2: vpbroadcastd zmm0, r8d lea rsi, buffer[0+rcx*4] vmovdqa64 zmm1, ZMMWORD PTR .LC1[rip] vpaddd zmm0, zmm0, ZMMWORD PTR .LC0[rip] xor ecx, ecx .L4: add ecx, 1 add rsi, 64 vmovdqa64 ZMMWORD PTR [rsi-64], zmm0 cmp ecx, edi vpaddd zmm0, zmm0, zmm1 jb .L4 sub edx, r10d cmp r9d, r10d lea eax, [r8+r10] je .L1 mov ecx, eax cmp edx, 1 mov DWORD PTR buffer[0+rcx*4], eax lea ecx, [rax+1] je .L1 mov esi, ecx cmp edx, 2 mov DWORD PTR buffer[0+rsi*4], ecx lea ecx, [rax+2] </code></pre> As you can see, this code has a lot of complicated moves and jumps and in general feels like a very complex way of performing a simple array initialization. Why is there such a big difference in the generated code? Is the GCC C++-compiler better in general at optimizing code that is valid in both C and C++ when compared to the C-compiler?

The extra code is for handling misalignment because the instruction used, <code>vmovdqa64</code>, requires 64 byte alignment. My testing shows that even though the standard doesn't, gcc does allow a definition in another module to override the one here when in C mode. That definition might only comply with the basic alignment requirements (4 bytes) thus the compiler can't rely on the bigger alignment. Technically, gcc emits a <code>.comm</code> assembly directive for this tentative definition, while an external definition uses a normal symbol in the <code>.data</code> section. During linking this symbol takes precedence over the <code>.comm</code> one. Note if you change the program to use <code>extern unsigned int buffer[2048];</code> then even the C++ version will have the added code. Conversely, making it <code>static unsigned int buffer[2048];</code> will turn the C version into the optimized one.

Big differences in GCC code generation when compiling as C++ vs C

Tags:

c++

c

gcc

assembly

x86-64

I've been playing around a little bit with x86-64 assembly trying to learn more about the various SIMD extensions that are available (MMX, SSE, AVX).

In order to see how different C or C++ constructs are translated into machine code by GCC I've been using Compiler Explorer which is a superb tool.

During one of my 'play sessions' I wanted to see how GCC could optimize a simple run-time initialization of an integer array. In this case I tried to write the numbers 0 to 2047 to an array of 2048 unsigned integers.

The code looks as follows:

unsigned int buffer[2048];  void setup() {   for (unsigned int i = 0; i < 2048; ++i)   {     buffer[i] = i;   } }

If I enable optimizations and AVX-512 instructions -O3 -mavx512f -mtune=intel GCC 6.3 generates some really clever code :)

setup():         mov     eax, OFFSET FLAT:buffer         mov     edx, OFFSET FLAT:buffer+8192         vmovdqa64       zmm0, ZMMWORD PTR .LC0[rip]         vmovdqa64       zmm1, ZMMWORD PTR .LC1[rip] .L2:         vmovdqa64       ZMMWORD PTR [rax], zmm0         add     rax, 64         cmp     rdx, rax         vpaddd  zmm0, zmm0, zmm1         jne     .L2         ret buffer:         .zero   8192 .LC0:         .long   0         .long   1         .long   2         .long   3         .long   4         .long   5         .long   6         .long   7         .long   8         .long   9         .long   10         .long   11         .long   12         .long   13         .long   14         .long   15 .LC1:         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16         .long   16

However, when I tested what would be generated if the same code was compiled using the GCC C-compiler by adding the flags -x c I was really surprised.

I expected similar, if not identical, results but the C-compiler seems to generate much more complicated and presumably also much slower machine code. The resulting assembly is too large to paste here in full, but it can be viewed at godbolt.org by following this link.

A snippet of the generated code, lines 58 to 83, can be seen below:

.L2:         vpbroadcastd    zmm0, r8d         lea     rsi, buffer[0+rcx*4]         vmovdqa64       zmm1, ZMMWORD PTR .LC1[rip]         vpaddd  zmm0, zmm0, ZMMWORD PTR .LC0[rip]         xor     ecx, ecx .L4:         add     ecx, 1         add     rsi, 64         vmovdqa64       ZMMWORD PTR [rsi-64], zmm0         cmp     ecx, edi         vpaddd  zmm0, zmm0, zmm1         jb      .L4         sub     edx, r10d         cmp     r9d, r10d         lea     eax, [r8+r10]         je      .L1         mov     ecx, eax         cmp     edx, 1         mov     DWORD PTR buffer[0+rcx*4], eax         lea     ecx, [rax+1]         je      .L1         mov     esi, ecx         cmp     edx, 2         mov     DWORD PTR buffer[0+rsi*4], ecx         lea     ecx, [rax+2]

As you can see, this code has a lot of complicated moves and jumps and in general feels like a very complex way of performing a simple array initialization.

Why is there such a big difference in the generated code?

Is the GCC C++-compiler better in general at optimizing code that is valid in both C and C++ when compared to the C-compiler?

809

asked Dec 22 '16 23:12

JonatanE

1 Answers

The extra code is for handling misalignment because the instruction used, vmovdqa64, requires 64 byte alignment.

My testing shows that even though the standard doesn't, gcc does allow a definition in another module to override the one here when in C mode. That definition might only comply with the basic alignment requirements (4 bytes) thus the compiler can't rely on the bigger alignment. Technically, gcc emits a .comm assembly directive for this tentative definition, while an external definition uses a normal symbol in the .data section. During linking this symbol takes precedence over the .comm one.

Note if you change the program to use extern unsigned int buffer[2048]; then even the C++ version will have the added code. Conversely, making it static unsigned int buffer[2048]; will turn the C version into the optimized one.

128

answered Sep 30 '22 02:09

Jester

Related questions
                            
                                Is there any difference between structure and union if we have only one member?
                            
                                C++ - interval tree implementation
                            
                                Visual c++ can't open include file 'iostream'
                            
                                pure/const function attributes in different compilers
                            
                                How to perfectly forward `auto&&` in a generic lambda?
                            
                                <system_error> categories and standard/system error codes
                            
                                Is it legal to use previous function parameter to declare new one?
                            
                                undefined reference to `__stack_chk_fail'
                            
                                How to simulate a key press in C++
                            
                                Can we make a class copy constructor virtual in C++
                            
                                Casting pointer to Array (int* to int[2])
                            
                                What is use of the ref-qualifier `const &&`?
                            
                                std::optional specialization for reference types
                            
                                Cmake generator expressions
                            
                                Declare array in C++ header and define it in cpp file?
                            
                                template class member function only specialization
                            
                                Better logging library for C++ [closed]
                            
                                Why lifetime of temporary doesn't extend till lifetime of enclosing object?
                            
                                Get memory address of member function?
                            
                                Why use std::less as the default functor to compare keys in std::map and std::set?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With