I tried to duplicate the example in this famous question. My code looks like this:
#![feature(test)]
extern crate rand;
extern crate test;
use test::Bencher;
use rand::{thread_rng, Rng};
type ItemType = u8;
type SumType = u64;
const TEST_SIZE: usize = 32_768;
#[bench]
fn bench_train(b: &mut Bencher) {
let numbers = get_random_vec();
b.iter(|| calc_sum(&numbers));
}
#[bench]
fn bench_train_sort(b: &mut Bencher) {
let mut numbers = get_random_vec();
numbers.sort(); // <-- the magic difference
b.iter(|| calc_sum(&numbers));
}
fn get_random_vec() -> Vec<ItemType> {
thread_rng().gen_iter().take(TEST_SIZE).collect()
}
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
let mut sum = 0;
for &num in numbers {
if num < ItemType::max_value() / 2 {
sum += num.into();
}
}
sum
}
If I benchmark the exact code from above I get reasonable results (like in the linked question):
test bench_train ... bench: 148,611 ns/iter (+/- 8,445)
test bench_train_sort ... bench: 21,064 ns/iter (+/- 1,980)
However, if I change SumType
to u8
both versions run equally fast and much faster overall:
test bench_train ... bench: 1,272 ns/iter (+/- 64)
test bench_train_sort ... bench: 1,280 ns/iter (+/- 170)
First of: of course, the sum
will overflow all the time, but in release mode the overflow checks of Rust are disabled, so we just calculate a wrong result without panicking. Could this be the reason for the surprisingly short time?
Even stranger: when I change the implementation of calc_sum
to something more idiomatic, the results change again. My second implementation:
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
numbers.iter()
.filter(|&&num| num < ItemType::max_value() / 2)
.fold(0, |acc, &num| acc + (num as SumType))
}
With this implementation the SumType
doesn't matter anymore. With u8
as well as with u64
I get these results:
test bench_train ... bench: 144,411 ns/iter (+/- 12,533)
test bench_train_sort ... bench: 16,966 ns/iter (+/- 1,100)
So we again get the numbers we are expecting. So the question is:
What is the reason for the strange running times?
PS: I tested with cargo bench
which compiles in release mode.
PPS: I just noticed that in the first implementation of calc_sum
I use into()
for casting, whereas I use as
in the second example. When also using as
in the first example, I get more strange numbers. With SumType = u64
:
test bench_train ... bench: 39,850 ns/iter (+/- 2,355)
test bench_train_sort ... bench: 39,344 ns/iter (+/- 2,581)
With SumType = u8
:
test bench_train ... bench: 1,184 ns/iter (+/- 339)
test bench_train_sort ... bench: 1,239 ns/iter (+/- 85)
I took a quick look at the assembler code, and it appears that if you use SumType = u8
then LLVM generates SSE2 instructions to do vector operations, which is much faster. In theory, LLVM should be able to optimize filter(...).fold(...)
to the same code, but in practice it cannot always remove overhead of abstraction. I hope when MIR is added, Rust won't rely on LLVM to do Rust-specific optimizations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With