I've made some tests on functional style programming performance in Rust:
extern crate rand; // 0.5.5
use rand::Rng;
fn time(f: impl FnOnce()) -> std::time::Duration {
let s = std::time::Instant::now();
f();
s.elapsed()
}
fn main() {
let iteration = 10000000;
let mut rng = rand::thread_rng();
println!(
"while: {:?}",
time(|| {
let mut i = 0;
let mut num = 0i64;
while i < iteration {
num += rng.gen::<i64>();
i += 1;
}
})
); // 29.116528ms
println!(
"for: {:?}",
time(|| {
let mut num = 0i64;
for _ in 0..iteration {
num += rng.gen::<i64>();
}
})
); // 26.68407ms
println!(
"fold: {:?}",
time(|| {
rng.gen_iter::<i64>().take(iteration).fold(0, |x, y| x + y);
})
); // 26.065936ms
}
I've set the optimization flag to compile it.
These three cases took nearly the same time, does that mean that functional programming in Rust is zero-cost?
Standard performance caveat Like always, you should benchmark your code in your situations and understand what the tradeoffs are. Start with understandable code and make it faster when/if necessary.
Here are the functions, broken out and made to never inline. I also prevented the random number generator from being inlined and made the iteration count smaller for later:
extern crate rand; // 0.5.5
use rand::{distributions::Standard, Rng, RngCore};
const ITERATION: usize = 10000;
#[inline(never)]
fn add_manual(mut rng: impl Rng) -> i64 {
let mut num = 0;
let mut i = 0;
while i < ITERATION {
num += rng.gen::<i64>();
i += 1;
}
num
}
#[inline(never)]
fn add_range(mut rng: impl Rng) -> i64 {
let mut num = 0;
for _ in 0..ITERATION {
num += rng.gen::<i64>();
}
num
}
#[inline(never)]
fn add_fold(mut rng: impl Rng) -> i64 {
rng.sample_iter::<i64, _>(&Standard)
.take(ITERATION)
.fold(0i64, |x, y| x + y)
}
#[inline(never)]
fn add_sum(mut rng: impl Rng) -> i64 {
rng.sample_iter::<i64, _>(&Standard).take(ITERATION).sum()
}
// Prevent inlining the RNG to create easier-to-inspect LLVM IR
struct NoInlineRng<R: Rng>(R);
impl<R: Rng> RngCore for NoInlineRng<R> {
#[inline(never)]
fn next_u32(&mut self) -> u32 {
self.0.next_u32()
}
#[inline(never)]
fn next_u64(&mut self) -> u64 {
self.0.next_u64()
}
#[inline(never)]
fn fill_bytes(&mut self, dest: &mut [u8]) {
self.0.fill_bytes(dest)
}
#[inline(never)]
fn try_fill_bytes(&mut self, dest: &mut [u8]) -> Result<(), rand::Error> {
self.0.try_fill_bytes(dest)
}
}
fn main() {
let mut rng = NoInlineRng(rand::thread_rng());
let a = add_manual(&mut rng);
let b = add_range(&mut rng);
let c = add_fold(&mut rng);
let d = add_sum(&mut rng);
println!("{}, {}, {}, {}", a, b, c, d);
}
And the corresponding LLVM IR, from Rust 1.29.2 building in release mode:
; playground::add_manual
; Function Attrs: noinline uwtable
define internal fastcc i64 @_ZN10playground10add_manual17hb7f61676b41e00bfE(i64** dereferenceable(8)) unnamed_addr #4 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
br label %bb4
bb3: ; preds = %bb4
ret i64 %2
bb4: ; preds = %bb4, %start
%num.09 = phi i64 [ 0, %start ], [ %2, %bb4 ]
%i.08 = phi i64 [ 0, %start ], [ %3, %bb4 ]
%rng.val.val = load i64*, i64** %0, align 8
; call <playground::NoInlineRng<R> as rand_core::RngCore>::next_u64
%1 = tail call fastcc i64 @"_ZN71_$LT$playground..NoInlineRng$LT$R$GT$$u20$as$u20$rand_core..RngCore$GT$8next_u6417h0b95e10cc642939aE"(i64* %rng.val.val)
%2 = add i64 %1, %num.09
%3 = add nuw nsw i64 %i.08, 1
%exitcond = icmp eq i64 %3, 10000
br i1 %exitcond, label %bb3, label %bb4
}
; playground::add_range
; Function Attrs: noinline uwtable
define internal fastcc i64 @_ZN10playground9add_range17h27ceded9d02ff747E(i64** dereferenceable(8)) unnamed_addr #4 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
br label %bb8
bb6: ; preds = %bb8
ret i64 %3
bb8: ; preds = %bb8, %start
%num.021 = phi i64 [ 0, %start ], [ %3, %bb8 ]
%iter.sroa.0.020 = phi i64 [ 0, %start ], [ %1, %bb8 ]
%1 = add nuw nsw i64 %iter.sroa.0.020, 1
%rng.val.val = load i64*, i64** %0, align 8
; call <playground::NoInlineRng<R> as rand_core::RngCore>::next_u64
%2 = tail call fastcc i64 @"_ZN71_$LT$playground..NoInlineRng$LT$R$GT$$u20$as$u20$rand_core..RngCore$GT$8next_u6417h0b95e10cc642939aE"(i64* %rng.val.val)
%3 = add i64 %2, %num.021
%exitcond = icmp eq i64 %1, 10000
br i1 %exitcond, label %bb6, label %bb8
}
; playground::add_sum
; Function Attrs: noinline uwtable
define internal fastcc i64 @_ZN10playground7add_sum17h0910bf39c2bf0430E(i64** dereferenceable(8)) unnamed_addr #4 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
bb2.i.i.i.i:
br label %bb2.i.i.i.i.i
bb2.i.i.i.i.i: ; preds = %bb2.i.i.i.i.i, %bb2.i.i.i.i
%1 = phi i64 [ 10000, %bb2.i.i.i.i ], [ %3, %bb2.i.i.i.i.i ]
%accum.0.i.i.i.i.i = phi i64 [ 0, %bb2.i.i.i.i ], [ %4, %bb2.i.i.i.i.i ]
%.val.val.i.i.i.i.i.i = load i64*, i64** %0, align 8, !noalias !33
; call <playground::NoInlineRng<R> as rand_core::RngCore>::next_u64
%2 = tail call fastcc i64 @"_ZN71_$LT$playground..NoInlineRng$LT$R$GT$$u20$as$u20$rand_core..RngCore$GT$8next_u6417h0b95e10cc642939aE"(i64* %.val.val.i.i.i.i.i.i), !noalias !33
%3 = add nsw i64 %1, -1
%4 = add i64 %2, %accum.0.i.i.i.i.i
%5 = icmp eq i64 %3, 0
br i1 %5, label %_ZN4core4iter8iterator8Iterator3sum17hcbc4a00f32ac1feeE.exit, label %bb2.i.i.i.i.i
_ZN4core4iter8iterator8Iterator3sum17hcbc4a00f32ac1feeE.exit: ; preds = %bb2.i.i.i.i.i
ret i64 %4
}
You can see that add_manual
and add_range
are basically the same, except for the position of add
. add_sum
is also similar, but it counts down from 10000 instead of counting up. There is no definition for add_fold
because the compiler has identified that it's the exact same code as add_sum
and combined them.
In this case, the optimizer can indeed make these basically the same. Let's use the built-in benchmarking:
#[bench]
fn bench_add_manual(b: &mut Bencher) {
b.iter(|| {
let rng = rand::thread_rng();
add_manual(rng)
});
}
#[bench]
fn bench_add_range(b: &mut Bencher) {
b.iter(|| {
let rng = rand::thread_rng();
add_range(rng)
});
}
#[bench]
fn bench_add_sum(b: &mut Bencher) {
b.iter(|| {
let rng = rand::thread_rng();
add_sum(rng)
});
}
The results are:
test bench_add_manual ... bench: 28,058 ns/iter (+/- 3,552)
test bench_add_range ... bench: 28,349 ns/iter (+/- 6,663)
test bench_add_sum ... bench: 29,807 ns/iter (+/- 2,016)
This seems pretty much the same to me. I would say, in this case, at this point in time, that there isn't a significant difference in the performance. However, this doesn't apply to every possible bit of code in a functional style.
Typically, fold (reduce) can compile to the equivalent efficient hand-compiled code, and so save programmer time. Notably, recursion in a fold is in tail position, so it is merely a simpler way to write a loop.
This will not be true of all programs you write in functional style.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With