Why does adding inline assembly comments cause such radical change in GCC's generated code?

Tags:

So, I had this code:

constexpr unsigned N = 1000; void f1(char* sum, char* a, char* b) {     for(int i = 0; i < N; ++i) {         sum[i] = a[i] + b[i];     } }  void f2(char* sum, char* a, char* b) {     char* end = sum + N;     while(sum != end) {         *sum++ = *a++ + *b++;     } }

I wanted to see the code that GCC 4.7.2 would generate. So I ran g++ -march=native -O3 -masm=intel -S a.c++ -std=c++11 And got the following output:

        .file   "a.c++"         .intel_syntax noprefix         .text         .p2align 4,,15         .globl  _Z2f1PcS_S_         .type   _Z2f1PcS_S_, @function _Z2f1PcS_S_: .LFB0:         .cfi_startproc         lea     rcx, [rdx+16]         lea     rax, [rdi+16]         cmp     rdi, rcx         setae   r8b         cmp     rdx, rax         setae   cl         or      cl, r8b         je      .L5         lea     rcx, [rsi+16]         cmp     rdi, rcx         setae   cl         cmp     rsi, rax         setae   al         or      cl, al         je      .L5         xor     eax, eax         .p2align 4,,10         .p2align 3 .L3:         movdqu  xmm0, XMMWORD PTR [rdx+rax]         movdqu  xmm1, XMMWORD PTR [rsi+rax]         paddb   xmm0, xmm1         movdqu  XMMWORD PTR [rdi+rax], xmm0         add     rax, 16         cmp     rax, 992         jne     .L3         mov     ax, 8         mov     r9d, 992 .L2:         sub     eax, 1         lea     rcx, [rdx+r9]         add     rdi, r9         lea     r8, [rax+1]         add     rsi, r9         xor     eax, eax         .p2align 4,,10         .p2align 3 .L4:         movzx   edx, BYTE PTR [rcx+rax]         add     dl, BYTE PTR [rsi+rax]         mov     BYTE PTR [rdi+rax], dl         add     rax, 1         cmp     rax, r8         jne     .L4         rep         ret .L5:         mov     eax, 1000         xor     r9d, r9d         jmp     .L2         .cfi_endproc .LFE0:         .size   _Z2f1PcS_S_, .-_Z2f1PcS_S_         .p2align 4,,15         .globl  _Z2f2PcS_S_         .type   _Z2f2PcS_S_, @function _Z2f2PcS_S_: .LFB1:         .cfi_startproc         lea     rcx, [rdx+16]         lea     rax, [rdi+16]         cmp     rdi, rcx         setae   r8b         cmp     rdx, rax         setae   cl         or      cl, r8b         je      .L19         lea     rcx, [rsi+16]         cmp     rdi, rcx         setae   cl         cmp     rsi, rax         setae   al         or      cl, al         je      .L19         xor     eax, eax         .p2align 4,,10         .p2align 3 .L17:         movdqu  xmm0, XMMWORD PTR [rdx+rax]         movdqu  xmm1, XMMWORD PTR [rsi+rax]         paddb   xmm0, xmm1         movdqu  XMMWORD PTR [rdi+rax], xmm0         add     rax, 16         cmp     rax, 992         jne     .L17         add     rdi, 992         add     rsi, 992         add     rdx, 992         mov     r8d, 8 .L16:         xor     eax, eax         .p2align 4,,10         .p2align 3 .L18:         movzx   ecx, BYTE PTR [rdx+rax]         add     cl, BYTE PTR [rsi+rax]         mov     BYTE PTR [rdi+rax], cl         add     rax, 1         cmp     rax, r8         jne     .L18         rep         ret .L19:         mov     r8d, 1000         jmp     .L16         .cfi_endproc .LFE1:         .size   _Z2f2PcS_S_, .-_Z2f2PcS_S_         .ident  "GCC: (GNU) 4.7.2"         .section        .note.GNU-stack,"",@progbits

I suck at reading assembly, so I decided to add some markers to know where the bodies of the loops went:

constexpr unsigned N = 1000; void f1(char* sum, char* a, char* b) {     for(int i = 0; i < N; ++i) {         asm("# im in ur loop");         sum[i] = a[i] + b[i];     } }  void f2(char* sum, char* a, char* b) {     char* end = sum + N;     while(sum != end) {         asm("# im in ur loop");         *sum++ = *a++ + *b++;     } }

And GCC spat this out:

    .file   "a.c++"     .intel_syntax noprefix     .text     .p2align 4,,15     .globl  _Z2f1PcS_S_     .type   _Z2f1PcS_S_, @function _Z2f1PcS_S_: .LFB0:     .cfi_startproc     xor eax, eax     .p2align 4,,10     .p2align 3 .L2: #APP # 4 "a.c++" 1     # im in ur loop # 0 "" 2 #NO_APP     movzx   ecx, BYTE PTR [rdx+rax]     add cl, BYTE PTR [rsi+rax]     mov BYTE PTR [rdi+rax], cl     add rax, 1     cmp rax, 1000     jne .L2     rep     ret     .cfi_endproc .LFE0:     .size   _Z2f1PcS_S_, .-_Z2f1PcS_S_     .p2align 4,,15     .globl  _Z2f2PcS_S_     .type   _Z2f2PcS_S_, @function _Z2f2PcS_S_: .LFB1:     .cfi_startproc     xor eax, eax     .p2align 4,,10     .p2align 3 .L6: #APP # 12 "a.c++" 1     # im in ur loop # 0 "" 2 #NO_APP     movzx   ecx, BYTE PTR [rdx+rax]     add cl, BYTE PTR [rsi+rax]     mov BYTE PTR [rdi+rax], cl     add rax, 1     cmp rax, 1000     jne .L6     rep     ret     .cfi_endproc .LFE1:     .size   _Z2f2PcS_S_, .-_Z2f2PcS_S_     .ident  "GCC: (GNU) 4.7.2"     .section    .note.GNU-stack,"",@progbits

This is considerably shorter, and has some significant differences like the lack of SIMD instructions. I was expecting the same output, with some comments somewhere in the middle of it. Am I making some wrong assumption here? Is GCC's optimizer hindered by asm comments?

455

asked Dec 19 '12 15:12

R. Martinho Fernandes

1 Answers

The interactions with optimisations are explained about halfway down the "Assembler Instructions with C Expression Operands" page in the documentation.

GCC doesn't try to understand any of the actual assembly inside the asm; the only thing it knows about the content is what you (optionally) tell it in the output and input operand specification and the register clobber list.

In particular, note:

An asm instruction without any output operands will be treated identically to a volatile asm instruction.

and

The volatile keyword indicates that the instruction has important side-effects [...]

So the presence of the asm inside your loop has inhibited a vectorisation optimisation, because GCC assumes it has side effects.

answered Sep 21 '22 17:09

Matthew Slattery

Related questions
                            
                                What is *.o file?
                            
                                Is Bjarne wrong about this example of ADL, or do I have a compiler bug?
                            
                                Object array initialization without default constructor
                            
                                Exclude source file in compilation using Makefile
                            
                                How to hide a string in binary code?
                            
                                How to write to the Output window in Visual Studio?
                            
                                Why are Python Programs often slower than the Equivalent Program Written in C or C++?
                            
                                C++ style cast from unsigned char * to const char *
                            
                                Use of min and max functions in C++
                            
                                Is there ever a need for a "do {...} while ( )" loop?
                            
                                c++ exception : throwing std::string
                            
                                Does the range-based 'for' loop deprecate many simple algorithms?
                            
                                Which C++ Standard Library wrapper functions do you use?
                            
                                C++ Cross-Platform High-Resolution Timer
                            
                                Printing all environment variables in C / C++
                            
                                Is C# really slower than say C++?
                            
                                When to Overload the Comma Operator?
                            
                                Generic way to cast int to enum in C++
                            
                                What is Security Development Lifecycle Checks option in Visual Studio?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does adding inline assembly comments cause such radical change in GCC's generated code?

Tags:

c++

optimization

gcc

assembly

inline-assembly

R. Martinho Fernandes

People also ask

1 Answers

Matthew Slattery

Recent Activity

Donate For Us