I have a small struct: <pre class="prettyprint"><code>pub struct Foo { pub a: i32, pub b: i32, pub c: i32, } </code></pre> I was using pairs of the fields in the form <code>(a,b) (b,c) (c,a)</code>. To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs: <pre class="prettyprint"><code>fn get_foo_ref(&self) -> [(&i32, &i32); 3] { [(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)] } </code></pre> I had to decide if I should return the values as references or copy the <code>i32</code>. Later on, I plan to switch to a non-<code>Copy</code> type instead of an <code>i32</code>, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined. I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples. First the variant using the function: <pre class="prettyprint"><code>pub fn testing_ref(f: Foo) -> i32 { let mut sum = 0; for i in 0..3 { let (l, r) = f.get_foo_ref()[i]; sum += *l + *r; } sum } </code></pre> Then the hand-written variant: <pre class="prettyprint"><code>pub fn testing_direct(f: Foo) -> i32 { let mut sum = 0; sum += f.a + f.b; sum += f.b + f.c; sum += f.c + f.a; sum } </code></pre> To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case? You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++. In C++, the compiler generated equivalent code between <code>get_foo</code> and <code>get_foo_ref</code>, although I don't understand why the code for all 3 cases is not equivalent. Why did the compiler did not generate equivalent code for all 3 cases? Update: I've modified slightly code to use arrays and to add one more direct case. Rust version with f64 and arrays C++ version with f64 and arrays This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly. Well, I guess this is another example that nothing can be taken for granted.

TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance. <hr> <blockquote> Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. </blockquote> Then, you should check the generated assembly for your new type. In your optimized example, the compiler is being very crafty: <blockquote> <pre class="prettyprint"><code>pub fn testing_direct(f: Foo) -> i32 { let mut sum = 0; sum += f.a + f.b; sum += f.b + f.c; sum += f.c + f.a; sum } </code></pre> </blockquote> Yields: <blockquote> <pre class="prettyprint"><code>example::testing_direct: push rbp mov rbp, rsp mov eax, dword ptr [rdi + 4] add eax, dword ptr [rdi] add eax, dword ptr [rdi + 8] add eax, eax pop rbp ret </code></pre> </blockquote> Which is roughly <code>sum += f.a; sum += f.b; sum += f.c; sum += sum;</code>. That is, the compiler realized that: <ol> <li> <code>f.X</code> was added twice</li> <li> <code>f.X * 2</code> was equivalent to adding it twice</li> </ol> While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to <code>i32</code> (and addition being commutative). For example, switching your code to <code>f32</code> (still <code>Copy</code>, but addition is not longer commutative), I get the very same assembly for both <code>testing_direct</code> and <code>testing</code> (and slightly different for <code>testing_ref</code>): <blockquote> <pre class="prettyprint"><code>example::testing: push rbp mov rbp, rsp movss xmm1, dword ptr [rdi] movss xmm2, dword ptr [rdi + 4] movss xmm0, dword ptr [rdi + 8] movaps xmm3, xmm1 addss xmm3, xmm2 xorps xmm4, xmm4 addss xmm4, xmm3 addss xmm2, xmm0 addss xmm2, xmm4 addss xmm0, xmm1 addss xmm0, xmm2 pop rbp ret </code></pre> </blockquote> And there's no trickery any longer. So it's really not possible to infer much from your example, check with the real type.

Why is the produced assembly not equivalent between returning by reference and copy when inlined?

Tags:

rust

I have a small struct:

pub struct Foo {
    pub a: i32,
    pub b: i32,
    pub c: i32,
}

I was using pairs of the fields in the form (a,b) (b,c) (c,a). To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs:

fn get_foo_ref(&self) -> [(&i32, &i32); 3] {
    [(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)]
}

I had to decide if I should return the values as references or copy the i32. Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined.

I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples.

First the variant using the function:

pub fn testing_ref(f: Foo) -> i32 {
    let mut sum = 0;

    for i in 0..3 {
        let (l, r) = f.get_foo_ref()[i];

        sum += *l + *r;
    }

    sum
}

Then the hand-written variant:

pub fn testing_direct(f: Foo) -> i32 {
    let mut sum = 0;

    sum += f.a + f.b;
    sum += f.b + f.c;
    sum += f.c + f.a;

    sum
}

To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case?

You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++.

In C++, the compiler generated equivalent code between get_foo and get_foo_ref, although I don't understand why the code for all 3 cases is not equivalent.

Why did the compiler did not generate equivalent code for all 3 cases?

Update:

I've modified slightly code to use arrays and to add one more direct case.
Rust version with f64 and arrays
C++ version with f64 and arrays
This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly.

Well, I guess this is another example that nothing can be taken for granted.

475

asked Feb 06 '17 19:02

Aleksander Fular

1 Answers

TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance.

Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references.

Then, you should check the generated assembly for your new type.

In your optimized example, the compiler is being very crafty:

pub fn testing_direct(f: Foo) -> i32 {
    let mut sum = 0;

    sum += f.a + f.b;
    sum += f.b + f.c;
    sum += f.c + f.a;

    sum
}

Yields:

example::testing_direct:
        push    rbp
        mov     rbp, rsp
        mov     eax, dword ptr [rdi + 4]
        add     eax, dword ptr [rdi]
        add     eax, dword ptr [rdi + 8]
        add     eax, eax
        pop     rbp
        ret

Which is roughly sum += f.a; sum += f.b; sum += f.c; sum += sum;.

That is, the compiler realized that:

f.X was added twice
f.X * 2 was equivalent to adding it twice

While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to i32 (and addition being commutative).

For example, switching your code to f32 (still Copy, but addition is not longer commutative), I get the very same assembly for both testing_direct and testing (and slightly different for testing_ref):

example::testing:
        push    rbp
        mov     rbp, rsp
        movss   xmm1, dword ptr [rdi]
        movss   xmm2, dword ptr [rdi + 4]
        movss   xmm0, dword ptr [rdi + 8]
        movaps  xmm3, xmm1
        addss   xmm3, xmm2
        xorps   xmm4, xmm4
        addss   xmm4, xmm3
        addss   xmm2, xmm0
        addss   xmm2, xmm4
        addss   xmm0, xmm1
        addss   xmm0, xmm2
        pop     rbp
        ret

And there's no trickery any longer.

So it's really not possible to infer much from your example, check with the real type.

176

answered Oct 25 '22 00:10

Matthieu M.

Related questions
                            
                                How to make error-chain errors compatible with Failure errors?
                            
                                Is it idiomatic to use `impl<T> From<T> for Option<T>` in argument position?
                            
                                How do I implement Sized, Serialize/Deserialize functions on Any and Send Traits?
                            
                                How to "deserialize with" for a container using serde in Rust
                            
                                Conditional compilation for Rust build.rs script?
                            
                                Why do I get "the method exists but the following trait bounds were not satisfied" when extending Result for failure types?
                            
                                How to correctly deprecate a crate feature
                            
                                My Cargo.toml is displaying some red lines with error couldn't compile serde_derive
                            
                                Working with single file rust using rust-analyzer
                            
                                Rust, need a mutable reference of Self inside iteration
                            
                                Generalizing iteraton method in Rust
                            
                                Using a static integer in the definition of a struct
                            
                                How do I select different std::cmp::Ord (or other trait) implementations for a given type?
                            
                                D-Bus Desktop Notification using dbus-rs
                            
                                How to find out what type a rustc::middle::ty::Ty represents?
                            
                                How do I return an error from a scoped_threadpool thread?
                            
                                Calling mmap on dumbbuffer with Linux’ Direct Rendering Manager in Rust fails while working in C
                            
                                Unable to use or cast a constructor as a fn
                            
                                How can you compile a Rust library to target asm.js?
                            
                                Get list of active dependencies and their versions during "cargo build"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With