Reading the Zero-cost abstractions and looking at Introduction to rust: a low-level language with high-level abstractions I tried to compare two approaches to computing the dot product of a vector: one using a for loop and one using iterators.
#![feature(test)]
extern crate rand;
extern crate test;
use std::cmp::min;
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum::<f64>()
}
#[cfg(test)]
mod bench {
use test::Bencher;
use rand::{Rng,thread_rng};
use super::*;
const LEN: usize = 30;
#[test]
fn test_1() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_1(&x, &y);
assert_eq!(result, 28.0);
}
#[test]
fn test_2() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_2(&x, &y);
assert_eq!(result, 28.0);
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen::<f64>()).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_1(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_2(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
}
The later of the links above claims that the version with the iterators should have similar performance "and actually be a little bit faster". However, when benchmarking the two, I get very different results:
running 2 tests
test bench::bench_small_loop ... bench: 20 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 24 ns/iter (+/- 2)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
So, where did the "zero-cost abstraction" go?
Update: Adding the foldr
example provided by @wimh and using split_at
instead of slices give the following result.
running 3 tests
test bench::bench_small_fold ... bench: 18 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 21 ns/iter (+/- 1)
test bench::bench_small_loop ... bench: 24 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out
So it seems that the additional time comes directly or indirectly from constructing the slices inside the measured code. To check that it indeed was the case, I tried the following two approaches with the same result (here shown for foldr
case and using map
+ sum
):
#[bench]
fn bench_small_iter(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let s0 = &samples[0..LEN];
let s1 = &samples[LEN..2 * LEN];
b.iter(|| dot_product_iter(s0, s1))
}
#[bench]
fn bench_small_fold(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_fold(s0, s1))
}
Iterator and for-each loop are faster than simple for loop for collections with no random access, while in collections which allows random access there is no performance change with for-each loop/for loop/iterator.
Iterators. To determine whether to use loops or iterators, you need to know which implementation is faster: the version of the search function with an explicit for loop or the version with iterators. The iterator version was slightly faster!
The code only looks slow when un-commenting the for loop because it does not do anything otherwise. Iterators are lazy, and only perform some activity when consumed. A for loop is an example of a construct which consumes the iterator.
In Rust, iterators are lazy, meaning they have no effect until you call methods that consume the iterator to use it up. For example, the code in Listing 13-10 creates an iterator over the items in the vector v1 by calling the iter method defined on Vec<T> . This code by itself doesn't do anything useful.
It seems zero cost to me. I wrote your code slightly more idiomatically, using the same random values for both tests, and then tested multiple times:
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
result
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: usize) -> Vec<f64> {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
rng.sample_iter(&rand::distributions::Standard).take(cnt).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_1(s0, s1))
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_2(s0, s1))
}
bench_small_1 20 ns/iter (+/- 6)
bench_small_2 17 ns/iter (+/- 1)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 17 ns/iter (+/- 2)
bench_small_1 19 ns/iter (+/- 5)
bench_small_2 17 ns/iter (+/- 3)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 24 ns/iter (+/- 7)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 24 ns/iter (+/- 1)
bench_small_1 27 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 0)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 17 ns/iter (+/- 1)
In 9 of the 10 runs, the idiomatic code was faster than the for loop. This was done on 2.9 GHz Core i9 (I9-8950HK) with 32 GB RAM, compiled with rustc 1.31.0-nightly (fc403ad98 2018-09-30)
.
For fun, I rewrote the benchmark to use criterion, a port of the Haskell benchmarking library.
Cargo.toml
[package]
name = "mats-zero-cost-rust"
version = "0.1.0"
authors = ["mats"]
[dev-dependencies]
criterion = "0.2"
rand = "0.4"
[[bench]]
name = "my_benchmark"
harness = false
benches/my_benchmark.rs
#[macro_use]
extern crate criterion;
extern crate rand;
use std::cmp::min;
use criterion::Criterion;
use rand::{thread_rng, Rng};
const LEN: usize = 30;
fn dot_product_loop(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_iter(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen()).collect()
}
fn criterion_loop_with_slice(c: &mut Criterion) {
c.bench_function("loop with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_loop(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_loop_without_slice(c: &mut Criterion) {
c.bench_function("loop without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_loop(s0, s1))
});
}
fn criterion_iter_with_slice(c: &mut Criterion) {
c.bench_function("iterators with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_iter(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_iter_without_slice(c: &mut Criterion) {
c.bench_function("iterators without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_iter(s0, s1))
});
}
criterion_group!(benches, criterion_loop_with_slice, criterion_loop_without_slice, criterion_iter_with_slice, criterion_iter_without_slice);
criterion_main!(benches);
I observe these results;
kolmodin@blin:~/code/mats-zero-cost-rust$ cargo bench
Compiling mats-zero-cost-rust v0.1.0 (/home/kolmodin/code/mats-zero-cost-rust)
Finished release [optimized] target(s) in 1.16s
Running target/release/deps/my_benchmark-6f00e042fd40bc1d
Gnuplot not found, disabling plotting
loop with slice time: [7.5794 ns 7.6843 ns 7.8016 ns]
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) high mild
13 (13.00%) high severe
loop without slice time: [24.384 ns 24.486 ns 24.589 ns]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low severe
1 (1.00%) low mild
iterators with slice time: [13.842 ns 13.852 ns 13.863 ns]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
iterators without slice time: [13.420 ns 13.424 ns 13.430 ns]
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low mild
1 (1.00%) high mild
10 (10.00%) high severe
Gnuplot not found, disabling plotting
Using rustc 1.30.0 (da5f414c2 2018-10-24)
on an AMD Ryzen 7 2700X.
The iterator implementation gets similar results for using slice
and split_at
.
The loop implementation gets very different results. The version with slice is significantly faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With