Why does this code generate much more assembly than equivalent C++/Clang? [closed]

Tags:

I wrote a simple C++ function in order to check compiler optimization:

bool f1(bool a, bool b) {
    return !a || (a && b);
}

After that I checked the equivalent in Rust:

Click to copy

fn f1(a: bool, b: bool) -> bool {
    !a || (a && b)
}

I used godbolt to check the assembler output.

The result of the C++ code (compiled by clang with -O3 flag) is following:

Click to copy

f1(bool, bool):                                # @f1(bool, bool)
    xor     dil, 1
    or      dil, sil
    mov     eax, edi
    ret

And the result of Rust equivalent is much longer:

Click to copy

example::f1:
  push rbp
  mov rbp, rsp
  mov al, sil
  mov cl, dil
  mov dl, cl
  xor dl, -1
  test dl, 1
  mov byte ptr [rbp - 3], al
  mov byte ptr [rbp - 4], cl
  jne .LBB0_1
  jmp .LBB0_3
.LBB0_1:
  mov byte ptr [rbp - 2], 1
  jmp .LBB0_4
.LBB0_2:
  mov byte ptr [rbp - 2], 0
  jmp .LBB0_4
.LBB0_3:
  mov al, byte ptr [rbp - 4]
  test al, 1
  jne .LBB0_7
  jmp .LBB0_6
.LBB0_4:
  mov al, byte ptr [rbp - 2]
  and al, 1
  movzx eax, al
  pop rbp
  ret
.LBB0_5:
  mov byte ptr [rbp - 1], 1
  jmp .LBB0_8
.LBB0_6:
  mov byte ptr [rbp - 1], 0
  jmp .LBB0_8
.LBB0_7:
  mov al, byte ptr [rbp - 3]
  test al, 1
  jne .LBB0_5
  jmp .LBB0_6
.LBB0_8:
  test byte ptr [rbp - 1], 1
  jne .LBB0_1
  jmp .LBB0_2

I also tried with -O option but the output is empty (deleted unused function).

I intentionally am NOT using any library in order to keep output clean. Please notice that both clang and rustc use LLVM as a backend. What explains this huge output difference? And if it is only disabled-optimize-switch problem, how can I see optimized output from rustc?

786

asked Aug 08 '17 07:08

Mariusz Jaskółka

1 Answers

Compiling with the compiler flag -O (and with an added pub), I get this output (Link to Godbolt):

Click to copy

push    rbp
mov     rbp, rsp
xor     dil, 1
or      dil, sil
mov     eax, edi
pop     rbp
ret

A few things:

Why is it still longer than the C++ version?

The Rust version is exactly three instructions longer:

Click to copy
```
push    rbp
mov     rbp, rsp
[...]
pop     rbp
```
These are instructions to manage the so called frame pointer or base pointer (rbp). This is mainly required to get nice stack traces. If you disable it for the C++ version via -fno-omit-frame-pointer, you get the same result. Note that this uses g++ instead of clang++ since I haven't found a comparable option for the clang compiler.
Why doesn't Rust omit frame pointer?

Actually, it does. But Godbolt adds an option to the compiler to preserve frame pointer. You can read more about why this is done here. If you compile your code locally with rustc -O --crate-type=lib foo.rs --emit asm -C "llvm-args=-x86-asm-syntax=intel", you get this output:

Click to copy
```
f1:
    xor dil, 1
    or  dil, sil
    mov eax, edi
    ret
```
Which is exactly the output of your C++ version.

You can "undo" what Godbolt does by passing -C debuginfo=0 to the compiler.
Why -O instead of --release?

Godbolt uses rustc directly instead of cargo. The --release flag is a flag for cargo. To enable optimizations on rustc, you need to pass -O or -C opt-level=3 (or any other level between 0 and 3).

answered Sep 29 '22 21:09

Lukas Kalbertodt

Related questions
                            
                                Multi Threading / Multi Tasking in PHP
                            
                                Parallel optimizations in SciPy
                            
                                c++: how to optimize IO?
                            
                                pandas df.loc[z,x]=y how to improve speed?
                            
                                What are some refactoring methods to reduce size of compiled code?
                            
                                Optimizing Haskell code
                            
                                Optimization of MySQL search using "like" and wildcards
                            
                                Do compilers automatically optimise repeated calls to mathematical functions?
                            
                                Optimizations to reduce website loading time
                            
                                How could these case conversion functions be improved?
                            
                                Should I use std::vector::at() in my code
                            
                                In C++, can I declare a reference so as to indicate that nothing will ever modify it?
                            
                                How can the rep stosb instruction execute faster than the equivalent loop?
                            
                                Test whether a register is zero with CMP reg,0 vs OR reg,reg?
                            
                                Graph layout optimization in C#
                            
                                Remove all zeros from array
                            
                                How to fill a CAShapeLayer with an angled gradient
                            
                                Java: Calculate SHA-256 hash of large file efficiently
                            
                                Completely remove ViewState for specific pages
                            
                                Is there a performance benefit in using fixed-length lists in Dart?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does this code generate much more assembly than equivalent C++/Clang? [closed]

Tags:

optimization

rust

llvm-codegen

Mariusz Jaskółka

People also ask

1 Answers

Lukas Kalbertodt

Recent Activity

Donate For Us