I've been playing around a little bit with x86-64 assembly trying to learn more about the various SIMD extensions that are available (MMX, SSE, AVX).
In order to see how different C or C++ constructs are translated into machine code by GCC I've been using Compiler Explorer which is a superb tool.
During one of my 'play sessions' I wanted to see how GCC could optimize a simple run-time initialization of an integer array. In this case I tried to write the numbers 0 to 2047 to an array of 2048 unsigned integers.
The code looks as follows:
unsigned int buffer[2048]; void setup() { for (unsigned int i = 0; i < 2048; ++i) { buffer[i] = i; } }
If I enable optimizations and AVX-512 instructions -O3 -mavx512f -mtune=intel
GCC 6.3 generates some really clever code :)
setup(): mov eax, OFFSET FLAT:buffer mov edx, OFFSET FLAT:buffer+8192 vmovdqa64 zmm0, ZMMWORD PTR .LC0[rip] vmovdqa64 zmm1, ZMMWORD PTR .LC1[rip] .L2: vmovdqa64 ZMMWORD PTR [rax], zmm0 add rax, 64 cmp rdx, rax vpaddd zmm0, zmm0, zmm1 jne .L2 ret buffer: .zero 8192 .LC0: .long 0 .long 1 .long 2 .long 3 .long 4 .long 5 .long 6 .long 7 .long 8 .long 9 .long 10 .long 11 .long 12 .long 13 .long 14 .long 15 .LC1: .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16 .long 16
However, when I tested what would be generated if the same code was compiled using the GCC C-compiler by adding the flags -x c
I was really surprised.
I expected similar, if not identical, results but the C-compiler seems to generate much more complicated and presumably also much slower machine code. The resulting assembly is too large to paste here in full, but it can be viewed at godbolt.org by following this link.
A snippet of the generated code, lines 58 to 83, can be seen below:
.L2: vpbroadcastd zmm0, r8d lea rsi, buffer[0+rcx*4] vmovdqa64 zmm1, ZMMWORD PTR .LC1[rip] vpaddd zmm0, zmm0, ZMMWORD PTR .LC0[rip] xor ecx, ecx .L4: add ecx, 1 add rsi, 64 vmovdqa64 ZMMWORD PTR [rsi-64], zmm0 cmp ecx, edi vpaddd zmm0, zmm0, zmm1 jb .L4 sub edx, r10d cmp r9d, r10d lea eax, [r8+r10] je .L1 mov ecx, eax cmp edx, 1 mov DWORD PTR buffer[0+rcx*4], eax lea ecx, [rax+1] je .L1 mov esi, ecx cmp edx, 2 mov DWORD PTR buffer[0+rsi*4], ecx lea ecx, [rax+2]
As you can see, this code has a lot of complicated moves and jumps and in general feels like a very complex way of performing a simple array initialization.
Why is there such a big difference in the generated code?
Is the GCC C++-compiler better in general at optimizing code that is valid in both C and C++ when compared to the C-compiler?
GCC stands for GNU Compiler Collections which is used to compile mainly C and C++ language. It can also be used to compile Objective C and Objective C++.
The GNU compiler collection, GCC, is one of the most famous open-source tools in existence. It is a tool that can be used to compile multiple languages and not just C or C++. The current version of GCC, GCC 11, has full support for C++17 core language features as well as C++17 library features.
The GNU C++ compiler provided by GCC is a true C++ compiler--it compiles C++ source code directly into assembly language. Some other C++ "compilers" are translators which convert C++ programs into C, and then compile the resulting C program using an existing C compiler.
Clang is much faster and uses far less memory than GCC. Clang aims to provide extremely clear and concise diagnostics (error and warning messages), and includes support for expressive diagnostics. GCC's warnings are sometimes acceptable, but are often confusing and it does not support expressive diagnostics.
The extra code is for handling misalignment because the instruction used, vmovdqa64
, requires 64 byte alignment.
My testing shows that even though the standard doesn't, gcc does allow a definition in another module to override the one here when in C mode. That definition might only comply with the basic alignment requirements (4 bytes) thus the compiler can't rely on the bigger alignment. Technically, gcc emits a .comm
assembly directive for this tentative definition, while an external definition uses a normal symbol in the .data
section. During linking this symbol takes precedence over the .comm
one.
Note if you change the program to use extern unsigned int buffer[2048];
then even the C++ version will have the added code. Conversely, making it static unsigned int buffer[2048];
will turn the C version into the optimized one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With