Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Zero-cost abstractions: performance of for-loop vs. iterators

Reading the Zero-cost abstractions and looking at Introduction to rust: a low-level language with high-level abstractions I tried to compare two approaches to computing the dot product of a vector: one using a for loop and one using iterators.

#![feature(test)]

extern crate rand;
extern crate test;

use std::cmp::min;

fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
    let mut result: f64 = 0.0;
    for i in 0..min(x.len(), y.len()) {
        result += x[i] * y[i];
    }
    return result;
}

fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
    x.iter().zip(y).map(|(&a, &b)| a * b).sum::<f64>()
}

#[cfg(test)]
mod bench {
    use test::Bencher;
    use rand::{Rng,thread_rng};
    use super::*;

    const LEN: usize = 30;

    #[test]
    fn test_1() {
        let x = [1.0, 2.0, 3.0];
        let y = [2.0, 4.0, 6.0];
        let result = dot_product_1(&x, &y);
        assert_eq!(result, 28.0);
    }

    #[test]
    fn test_2() {
        let x = [1.0, 2.0, 3.0];
        let y = [2.0, 4.0, 6.0];
        let result = dot_product_2(&x, &y);
        assert_eq!(result, 28.0);
    }

    fn rand_array(cnt: u32) -> Vec<f64> {
        let mut rng = thread_rng();
        (0..cnt).map(|_| rng.gen::<f64>()).collect()

    }

    #[bench]
    fn bench_small_1(b: &mut Bencher) {
        let samples = rand_array(2*LEN as u32);
        b.iter(|| {
            dot_product_1(&samples[0..LEN], &samples[LEN..2*LEN])
        })
    }

    #[bench]
    fn bench_small_2(b: &mut Bencher) {
        let samples = rand_array(2*LEN as u32);
        b.iter(|| {
            dot_product_2(&samples[0..LEN], &samples[LEN..2*LEN])
        })
    }
}

The later of the links above claims that the version with the iterators should have similar performance "and actually be a little bit faster". However, when benchmarking the two, I get very different results:

running 2 tests
test bench::bench_small_loop ... bench:          20 ns/iter (+/- 1)
test bench::bench_small_iter ... bench:          24 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

So, where did the "zero-cost abstraction" go?

Update: Adding the foldr example provided by @wimh and using split_at instead of slices give the following result.

running 3 tests
test bench::bench_small_fold ... bench:          18 ns/iter (+/- 1)
test bench::bench_small_iter ... bench:          21 ns/iter (+/- 1)
test bench::bench_small_loop ... bench:          24 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out

So it seems that the additional time comes directly or indirectly from constructing the slices inside the measured code. To check that it indeed was the case, I tried the following two approaches with the same result (here shown for foldr case and using map + sum):

#[bench]
fn bench_small_iter(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let s0 = &samples[0..LEN];
    let s1 = &samples[LEN..2 * LEN];
    b.iter(|| dot_product_iter(s0, s1))
}

#[bench]
fn bench_small_fold(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let (s0, s1) = samples.split_at(LEN);
    b.iter(|| dot_product_fold(s0, s1))
}
like image 778
Mats Kindahl Avatar asked Oct 20 '18 14:10

Mats Kindahl


People also ask

Are iterators faster than for loops?

Iterator and for-each loop are faster than simple for loop for collections with no random access, while in collections which allows random access there is no performance change with for-each loop/for loop/iterator.

Are iterators faster than for loops Rust?

Iterators. To determine whether to use loops or iterators, you need to know which implementation is faster: the version of the search function with an explicit for loop or the version with iterators. The iterator version was slightly faster!

Are Rust iterators slow?

The code only looks slow when un-commenting the for loop because it does not do anything otherwise. Iterators are lazy, and only perform some activity when consumed. A for loop is an example of a construct which consumes the iterator.

Are Rust iterators lazy?

In Rust, iterators are lazy, meaning they have no effect until you call methods that consume the iterator to use it up. For example, the code in Listing 13-10 creates an iterator over the items in the vector v1 by calling the iter method defined on Vec<T> . This code by itself doesn't do anything useful.


2 Answers

It seems zero cost to me. I wrote your code slightly more idiomatically, using the same random values for both tests, and then tested multiple times:

fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
    let mut result: f64 = 0.0;
    for i in 0..min(x.len(), y.len()) {
        result += x[i] * y[i];
    }
    result
}

fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
    x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: usize) -> Vec<f64> {
    let mut rng = rand::rngs::StdRng::seed_from_u64(42);
    rng.sample_iter(&rand::distributions::Standard).take(cnt).collect()
}

#[bench]
fn bench_small_1(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let (s0, s1) = samples.split_at(LEN);
    b.iter(|| dot_product_1(s0, s1))
}

#[bench]
fn bench_small_2(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let (s0, s1) = samples.split_at(LEN);
    b.iter(|| dot_product_2(s0, s1))
}
bench_small_1   20 ns/iter (+/- 6)
bench_small_2   17 ns/iter (+/- 1)

bench_small_1   19 ns/iter (+/- 3)
bench_small_2   17 ns/iter (+/- 2)

bench_small_1   19 ns/iter (+/- 5)
bench_small_2   17 ns/iter (+/- 3)

bench_small_1   19 ns/iter (+/- 3)
bench_small_2   24 ns/iter (+/- 7)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   24 ns/iter (+/- 1)

bench_small_1   27 ns/iter (+/- 1)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 0)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   17 ns/iter (+/- 1)

In 9 of the 10 runs, the idiomatic code was faster than the for loop. This was done on 2.9 GHz Core i9 (I9-8950HK) with 32 GB RAM, compiled with rustc 1.31.0-nightly (fc403ad98 2018-09-30).

like image 53
Shepmaster Avatar answered Sep 18 '22 22:09

Shepmaster


For fun, I rewrote the benchmark to use criterion, a port of the Haskell benchmarking library.

Cargo.toml

[package]
name = "mats-zero-cost-rust"
version = "0.1.0"
authors = ["mats"]

[dev-dependencies]
criterion = "0.2"
rand = "0.4"

[[bench]]
name = "my_benchmark"
harness = false

benches/my_benchmark.rs

#[macro_use]
extern crate criterion;
extern crate rand;

use std::cmp::min;

use criterion::Criterion;

use rand::{thread_rng, Rng};

const LEN: usize = 30;

fn dot_product_loop(x: &[f64], y: &[f64]) -> f64 {
    let mut result: f64 = 0.0;
    for i in 0..min(x.len(), y.len()) {
        result += x[i] * y[i];
    }
    return result;
}

fn dot_product_iter(x: &[f64], y: &[f64]) -> f64 {
    x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}

fn rand_array(cnt: u32) -> Vec<f64> {
    let mut rng = thread_rng();
    (0..cnt).map(|_| rng.gen()).collect()
}

fn criterion_loop_with_slice(c: &mut Criterion) {
    c.bench_function("loop with slice", |b| {
        let samples = rand_array(2 * LEN as u32);
        b.iter(|| dot_product_loop(&samples[0..LEN], &samples[LEN..2 * LEN]))
    });
}

fn criterion_loop_without_slice(c: &mut Criterion) {
    c.bench_function("loop without slice", |b| {
        let samples = rand_array(2 * LEN as u32);
        let (s0, s1) = samples.split_at(LEN);
        b.iter(|| dot_product_loop(s0, s1))
    });
}

fn criterion_iter_with_slice(c: &mut Criterion) {
    c.bench_function("iterators with slice", |b| {
        let samples = rand_array(2 * LEN as u32);
        b.iter(|| dot_product_iter(&samples[0..LEN], &samples[LEN..2 * LEN]))
    });
}

fn criterion_iter_without_slice(c: &mut Criterion) {
    c.bench_function("iterators without slice", |b| {
        let samples = rand_array(2 * LEN as u32);
        let (s0, s1) = samples.split_at(LEN);
        b.iter(|| dot_product_iter(s0, s1))
    });
}

criterion_group!(benches, criterion_loop_with_slice, criterion_loop_without_slice, criterion_iter_with_slice, criterion_iter_without_slice);
criterion_main!(benches);

I observe these results;

kolmodin@blin:~/code/mats-zero-cost-rust$ cargo bench
   Compiling mats-zero-cost-rust v0.1.0 (/home/kolmodin/code/mats-zero-cost-rust)                                                                                                                                  
    Finished release [optimized] target(s) in 1.16s                                                                                                                                                                
     Running target/release/deps/my_benchmark-6f00e042fd40bc1d
Gnuplot not found, disabling plotting
loop with slice         time:   [7.5794 ns 7.6843 ns 7.8016 ns]                             
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe

loop without slice      time:   [24.384 ns 24.486 ns 24.589 ns]                                
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild

iterators with slice    time:   [13.842 ns 13.852 ns 13.863 ns]                                  
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

iterators without slice time:   [13.420 ns 13.424 ns 13.430 ns]                                     
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  10 (10.00%) high severe

Gnuplot not found, disabling plotting

Using rustc 1.30.0 (da5f414c2 2018-10-24) on an AMD Ryzen 7 2700X.

The iterator implementation gets similar results for using slice and split_at.

The loop implementation gets very different results. The version with slice is significantly faster.

like image 30
L. Kolmodin Avatar answered Sep 20 '22 22:09

L. Kolmodin