So, I had this code:
constexpr unsigned N = 1000; void f1(char* sum, char* a, char* b) { for(int i = 0; i < N; ++i) { sum[i] = a[i] + b[i]; } } void f2(char* sum, char* a, char* b) { char* end = sum + N; while(sum != end) { *sum++ = *a++ + *b++; } }
I wanted to see the code that GCC 4.7.2 would generate. So I ran g++ -march=native -O3 -masm=intel -S a.c++ -std=c++11
And got the following output:
.file "a.c++" .intel_syntax noprefix .text .p2align 4,,15 .globl _Z2f1PcS_S_ .type _Z2f1PcS_S_, @function _Z2f1PcS_S_: .LFB0: .cfi_startproc lea rcx, [rdx+16] lea rax, [rdi+16] cmp rdi, rcx setae r8b cmp rdx, rax setae cl or cl, r8b je .L5 lea rcx, [rsi+16] cmp rdi, rcx setae cl cmp rsi, rax setae al or cl, al je .L5 xor eax, eax .p2align 4,,10 .p2align 3 .L3: movdqu xmm0, XMMWORD PTR [rdx+rax] movdqu xmm1, XMMWORD PTR [rsi+rax] paddb xmm0, xmm1 movdqu XMMWORD PTR [rdi+rax], xmm0 add rax, 16 cmp rax, 992 jne .L3 mov ax, 8 mov r9d, 992 .L2: sub eax, 1 lea rcx, [rdx+r9] add rdi, r9 lea r8, [rax+1] add rsi, r9 xor eax, eax .p2align 4,,10 .p2align 3 .L4: movzx edx, BYTE PTR [rcx+rax] add dl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], dl add rax, 1 cmp rax, r8 jne .L4 rep ret .L5: mov eax, 1000 xor r9d, r9d jmp .L2 .cfi_endproc .LFE0: .size _Z2f1PcS_S_, .-_Z2f1PcS_S_ .p2align 4,,15 .globl _Z2f2PcS_S_ .type _Z2f2PcS_S_, @function _Z2f2PcS_S_: .LFB1: .cfi_startproc lea rcx, [rdx+16] lea rax, [rdi+16] cmp rdi, rcx setae r8b cmp rdx, rax setae cl or cl, r8b je .L19 lea rcx, [rsi+16] cmp rdi, rcx setae cl cmp rsi, rax setae al or cl, al je .L19 xor eax, eax .p2align 4,,10 .p2align 3 .L17: movdqu xmm0, XMMWORD PTR [rdx+rax] movdqu xmm1, XMMWORD PTR [rsi+rax] paddb xmm0, xmm1 movdqu XMMWORD PTR [rdi+rax], xmm0 add rax, 16 cmp rax, 992 jne .L17 add rdi, 992 add rsi, 992 add rdx, 992 mov r8d, 8 .L16: xor eax, eax .p2align 4,,10 .p2align 3 .L18: movzx ecx, BYTE PTR [rdx+rax] add cl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl add rax, 1 cmp rax, r8 jne .L18 rep ret .L19: mov r8d, 1000 jmp .L16 .cfi_endproc .LFE1: .size _Z2f2PcS_S_, .-_Z2f2PcS_S_ .ident "GCC: (GNU) 4.7.2" .section .note.GNU-stack,"",@progbits
I suck at reading assembly, so I decided to add some markers to know where the bodies of the loops went:
constexpr unsigned N = 1000; void f1(char* sum, char* a, char* b) { for(int i = 0; i < N; ++i) { asm("# im in ur loop"); sum[i] = a[i] + b[i]; } } void f2(char* sum, char* a, char* b) { char* end = sum + N; while(sum != end) { asm("# im in ur loop"); *sum++ = *a++ + *b++; } }
And GCC spat this out:
.file "a.c++" .intel_syntax noprefix .text .p2align 4,,15 .globl _Z2f1PcS_S_ .type _Z2f1PcS_S_, @function _Z2f1PcS_S_: .LFB0: .cfi_startproc xor eax, eax .p2align 4,,10 .p2align 3 .L2: #APP # 4 "a.c++" 1 # im in ur loop # 0 "" 2 #NO_APP movzx ecx, BYTE PTR [rdx+rax] add cl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl add rax, 1 cmp rax, 1000 jne .L2 rep ret .cfi_endproc .LFE0: .size _Z2f1PcS_S_, .-_Z2f1PcS_S_ .p2align 4,,15 .globl _Z2f2PcS_S_ .type _Z2f2PcS_S_, @function _Z2f2PcS_S_: .LFB1: .cfi_startproc xor eax, eax .p2align 4,,10 .p2align 3 .L6: #APP # 12 "a.c++" 1 # im in ur loop # 0 "" 2 #NO_APP movzx ecx, BYTE PTR [rdx+rax] add cl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl add rax, 1 cmp rax, 1000 jne .L6 rep ret .cfi_endproc .LFE1: .size _Z2f2PcS_S_, .-_Z2f2PcS_S_ .ident "GCC: (GNU) 4.7.2" .section .note.GNU-stack,"",@progbits
This is considerably shorter, and has some significant differences like the lack of SIMD instructions. I was expecting the same output, with some comments somewhere in the middle of it. Am I making some wrong assumption here? Is GCC's optimizer hindered by asm comments?
In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.
Inline assembly (typically introduced by the asm keyword) gives the ability to embed assembly language source code within a C program.
Inline assembly with the __asm keyword You can use C or C++ comments anywhere in an inline assembly language block.
The interactions with optimisations are explained about halfway down the "Assembler Instructions with C Expression Operands" page in the documentation.
GCC doesn't try to understand any of the actual assembly inside the asm
; the only thing it knows about the content is what you (optionally) tell it in the output and input operand specification and the register clobber list.
In particular, note:
An
asm
instruction without any output operands will be treated identically to a volatileasm
instruction.
and
The
volatile
keyword indicates that the instruction has important side-effects [...]
So the presence of the asm
inside your loop has inhibited a vectorisation optimisation, because GCC assumes it has side effects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With