Trying famous branch-prediction example sometimes results in strange times

Question

I tried to duplicate the example in this famous question. My code looks like this:

#![feature(test)]
extern crate rand;
extern crate test;

use test::Bencher;
use rand::{thread_rng, Rng};

type ItemType = u8;
type SumType = u64;
const TEST_SIZE: usize = 32_768;

#[bench]
fn bench_train(b: &mut Bencher) {
    let numbers = get_random_vec();
    b.iter(|| calc_sum(&numbers));
}

#[bench]
fn bench_train_sort(b: &mut Bencher) {
    let mut numbers = get_random_vec();
    numbers.sort();     // <-- the magic difference
    b.iter(|| calc_sum(&numbers));
}

fn get_random_vec() -> Vec<ItemType> {
    thread_rng().gen_iter().take(TEST_SIZE).collect()
}

fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
    let mut sum = 0;
    for &num in numbers {
        if num < ItemType::max_value() / 2 {
            sum += num.into();
        }
    }

    sum
}

If I benchmark the exact code from above I get reasonable results (like in the linked question):

test bench_train      ... bench:     148,611 ns/iter (+/- 8,445)
test bench_train_sort ... bench:      21,064 ns/iter (+/- 1,980)

However, if I change SumType to u8 both versions run equally fast and much faster overall:

test bench_train      ... bench:       1,272 ns/iter (+/- 64)
test bench_train_sort ... bench:       1,280 ns/iter (+/- 170)

First of: of course, the sum will overflow all the time, but in release mode the overflow checks of Rust are disabled, so we just calculate a wrong result without panicking. Could this be the reason for the surprisingly short time?

Even stranger: when I change the implementation of calc_sum to something more idiomatic, the results change again. My second implementation:

fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
    numbers.iter()
        .filter(|&&num| num < ItemType::max_value() / 2)
        .fold(0, |acc, &num| acc + (num as SumType))
}

With this implementation the SumType doesn't matter anymore. With u8 as well as with u64 I get these results:

test bench_train      ... bench:     144,411 ns/iter (+/- 12,533)
test bench_train_sort ... bench:      16,966 ns/iter (+/- 1,100)

So we again get the numbers we are expecting. So the question is:

What is the reason for the strange running times?

PS: I tested with cargo bench which compiles in release mode.

PPS: I just noticed that in the first implementation of calc_sum I use into() for casting, whereas I use as in the second example. When also using as in the first example, I get more strange numbers. With SumType = u64:

test bench_train      ... bench:      39,850 ns/iter (+/- 2,355)
test bench_train_sort ... bench:      39,344 ns/iter (+/- 2,581)

With SumType = u8:

test bench_train      ... bench:       1,184 ns/iter (+/- 339)
test bench_train_sort ... bench:       1,239 ns/iter (+/- 85)

svat · Accepted Answer

I took a quick look at the assembler code, and it appears that if you use SumType = u8 then LLVM generates SSE2 instructions to do vector operations, which is much faster. In theory, LLVM should be able to optimize filter(...).fold(...) to the same code, but in practice it cannot always remove overhead of abstraction. I hope when MIR is added, Rust won't rely on LLVM to do Rust-specific optimizations.

Trying famous branch-prediction example sometimes results in strange times

Tags:

performance

branch-prediction

optimization

rust

Lukas Kalbertodt

1 Answers

svat

Recent Activity

Donate For Us

Trying famous branch-prediction example sometimes results in strange times

Tags:

performance

branch-prediction

optimization

rust

Lukas Kalbertodt

1 Answers

svat

Related questions

Recent Activity

Donate For Us