Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does adding inline assembly comments cause such radical change in GCC's generated code?

So, I had this code:

constexpr unsigned N = 1000; void f1(char* sum, char* a, char* b) {     for(int i = 0; i < N; ++i) {         sum[i] = a[i] + b[i];     } }  void f2(char* sum, char* a, char* b) {     char* end = sum + N;     while(sum != end) {         *sum++ = *a++ + *b++;     } } 

I wanted to see the code that GCC 4.7.2 would generate. So I ran g++ -march=native -O3 -masm=intel -S a.c++ -std=c++11 And got the following output:

        .file   "a.c++"         .intel_syntax noprefix         .text         .p2align 4,,15         .globl  _Z2f1PcS_S_         .type   _Z2f1PcS_S_, @function _Z2f1PcS_S_: .LFB0:         .cfi_startproc         lea     rcx, [rdx+16]         lea     rax, [rdi+16]         cmp     rdi, rcx         setae   r8b         cmp     rdx, rax         setae   cl         or      cl, r8b         je      .L5         lea     rcx, [rsi+16]         cmp     rdi, rcx         setae   cl         cmp     rsi, rax         setae   al         or      cl, al         je      .L5         xor     eax, eax         .p2align 4,,10         .p2align 3 .L3:         movdqu  xmm0, XMMWORD PTR [rdx+rax]         movdqu  xmm1, XMMWORD PTR [rsi+rax]         paddb   xmm0, xmm1         movdqu  XMMWORD PTR [rdi+rax], xmm0         add     rax, 16         cmp     rax, 992         jne     .L3         mov     ax, 8         mov     r9d, 992 .L2:         sub     eax, 1         lea     rcx, [rdx+r9]         add     rdi, r9         lea     r8, [rax+1]         add     rsi, r9         xor     eax, eax         .p2align 4,,10         .p2align 3 .L4:         movzx   edx, BYTE PTR [rcx+rax]         add     dl, BYTE PTR [rsi+rax]         mov     BYTE PTR [rdi+rax], dl         add     rax, 1         cmp     rax, r8         jne     .L4         rep         ret .L5:         mov     eax, 1000         xor     r9d, r9d         jmp     .L2         .cfi_endproc .LFE0:         .size   _Z2f1PcS_S_, .-_Z2f1PcS_S_         .p2align 4,,15         .globl  _Z2f2PcS_S_         .type   _Z2f2PcS_S_, @function _Z2f2PcS_S_: .LFB1:         .cfi_startproc         lea     rcx, [rdx+16]         lea     rax, [rdi+16]         cmp     rdi, rcx         setae   r8b         cmp     rdx, rax         setae   cl         or      cl, r8b         je      .L19         lea     rcx, [rsi+16]         cmp     rdi, rcx         setae   cl         cmp     rsi, rax         setae   al         or      cl, al         je      .L19         xor     eax, eax         .p2align 4,,10         .p2align 3 .L17:         movdqu  xmm0, XMMWORD PTR [rdx+rax]         movdqu  xmm1, XMMWORD PTR [rsi+rax]         paddb   xmm0, xmm1         movdqu  XMMWORD PTR [rdi+rax], xmm0         add     rax, 16         cmp     rax, 992         jne     .L17         add     rdi, 992         add     rsi, 992         add     rdx, 992         mov     r8d, 8 .L16:         xor     eax, eax         .p2align 4,,10         .p2align 3 .L18:         movzx   ecx, BYTE PTR [rdx+rax]         add     cl, BYTE PTR [rsi+rax]         mov     BYTE PTR [rdi+rax], cl         add     rax, 1         cmp     rax, r8         jne     .L18         rep         ret .L19:         mov     r8d, 1000         jmp     .L16         .cfi_endproc .LFE1:         .size   _Z2f2PcS_S_, .-_Z2f2PcS_S_         .ident  "GCC: (GNU) 4.7.2"         .section        .note.GNU-stack,"",@progbits 

I suck at reading assembly, so I decided to add some markers to know where the bodies of the loops went:

constexpr unsigned N = 1000; void f1(char* sum, char* a, char* b) {     for(int i = 0; i < N; ++i) {         asm("# im in ur loop");         sum[i] = a[i] + b[i];     } }  void f2(char* sum, char* a, char* b) {     char* end = sum + N;     while(sum != end) {         asm("# im in ur loop");         *sum++ = *a++ + *b++;     } } 

And GCC spat this out:

    .file   "a.c++"     .intel_syntax noprefix     .text     .p2align 4,,15     .globl  _Z2f1PcS_S_     .type   _Z2f1PcS_S_, @function _Z2f1PcS_S_: .LFB0:     .cfi_startproc     xor eax, eax     .p2align 4,,10     .p2align 3 .L2: #APP # 4 "a.c++" 1     # im in ur loop # 0 "" 2 #NO_APP     movzx   ecx, BYTE PTR [rdx+rax]     add cl, BYTE PTR [rsi+rax]     mov BYTE PTR [rdi+rax], cl     add rax, 1     cmp rax, 1000     jne .L2     rep     ret     .cfi_endproc .LFE0:     .size   _Z2f1PcS_S_, .-_Z2f1PcS_S_     .p2align 4,,15     .globl  _Z2f2PcS_S_     .type   _Z2f2PcS_S_, @function _Z2f2PcS_S_: .LFB1:     .cfi_startproc     xor eax, eax     .p2align 4,,10     .p2align 3 .L6: #APP # 12 "a.c++" 1     # im in ur loop # 0 "" 2 #NO_APP     movzx   ecx, BYTE PTR [rdx+rax]     add cl, BYTE PTR [rsi+rax]     mov BYTE PTR [rdi+rax], cl     add rax, 1     cmp rax, 1000     jne .L6     rep     ret     .cfi_endproc .LFE1:     .size   _Z2f2PcS_S_, .-_Z2f2PcS_S_     .ident  "GCC: (GNU) 4.7.2"     .section    .note.GNU-stack,"",@progbits 

This is considerably shorter, and has some significant differences like the lack of SIMD instructions. I was expecting the same output, with some comments somewhere in the middle of it. Am I making some wrong assumption here? Is GCC's optimizer hindered by asm comments?

like image 455
R. Martinho Fernandes Avatar asked Dec 19 '12 15:12

R. Martinho Fernandes


People also ask

How does inline assembly work?

In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.

What is the keyword used to add inline assembly into AC C++ application?

Inline assembly (typically introduced by the asm keyword) gives the ability to embed assembly language source code within a C program.

Which keyword is used for inline assembly?

Inline assembly with the __asm keyword You can use C or C++ comments anywhere in an inline assembly language block.


1 Answers

The interactions with optimisations are explained about halfway down the "Assembler Instructions with C Expression Operands" page in the documentation.

GCC doesn't try to understand any of the actual assembly inside the asm; the only thing it knows about the content is what you (optionally) tell it in the output and input operand specification and the register clobber list.

In particular, note:

An asm instruction without any output operands will be treated identically to a volatile asm instruction.

and

The volatile keyword indicates that the instruction has important side-effects [...]

So the presence of the asm inside your loop has inhibited a vectorisation optimisation, because GCC assumes it has side effects.

like image 73
Matthew Slattery Avatar answered Sep 21 '22 17:09

Matthew Slattery