I have many 4KiB buffers, which have a 50% chance of containing only zero values. The non-zero buffers will typically have a non-zero byte early in the buffer.
fn is_zero(buf: &Vec<u8>) -> bool {
for byte in buf.into_iter() {
if *byte != 0 {
return false;
}
}
return true;
}
Is this a performant way of checking in Rust, with --release
? (I am processing many GBs of data.)
(In the C version, I cast the buffer to unsigned long long
before checking. That probably wasn't the best I could do, given SSE etc.)
These statements yield a true value only if all bits contain the '0' value: These statements yield a true value only if all bits contain the '1' value: One way to check if a vector of any length is all zeros, is to convert it to an unsigned value and then compare it to its integer equivalent.
The first parameter is the vector to check, and the second one is the value to check if all bits contain exclusively. The function call above only returns true if all bits in the vector have the '0' value:
The simplest possible circuit for checking that all the bits in a vector are logical '1' values is to use an AND gate. The truth table for an AND gate states that it will output a true value if and only if all the input values evaluate to true. By anding together all the bits in the vector, we can check if all of them are '1'.
You can also test the array for elements that are greater than zero. The syntax A (:) turns the elements of A into a single column vector, so you can use this type of statement on an array of any size. Create a 3-by-3 matrix. Test the rows of A for all nonzero elements by specifying dim = 2.
You an use align_to
to convert the slice of u8
into a slice of u128
, making the comparison more efficient:
fn is_zero(buf: &[u8]) -> bool {
let (prefix, aligned, suffix) = unsafe { buf.align_to::<u128>() };
prefix.iter().all(|&x| x == 0)
&& suffix.iter().all(|&x| x == 0)
&& aligned.iter().all(|&x| x == 0)
}
Running a simple benchmark on my machine shows 16x performance gains!
#![feature(test)]
extern crate test;
fn v() -> Vec<u8> {
std::iter::repeat(0).take(1000000).collect()
}
fn is_zero(buf: &[u8]) -> bool {
buf.into_iter().all(|&b| b == 0)
}
fn is_zero_aligned(buf: &[u8]) -> bool {
let (prefix, aligned, suffix) = unsafe { buf.align_to::<u128>() };
prefix.iter().all(|&x| x == 0)
&& suffix.iter().all(|&x| x == 0)
&& aligned.iter().all(|&x| x == 0)
}
#[bench]
fn bench_is_zero(b: &mut test::Bencher) {
let v = test::black_box(v());
b.iter(|| is_zero(&v[..]))
}
#[bench]
fn bench_is_zero_aligned(b: &mut test::Bencher) {
let v = test::black_box(v());
b.iter(|| is_zero_aligned(&v[..]))
}
running 2 tests
test tests::bench_is_zero ... bench: 455,975 ns/iter (+/- 414)
test tests::bench_is_zero_aligned ... bench: 28,615 ns/iter (+/- 116)
Depending on your machine, different integer types (u64
) may yield better performance.
Thanks to @Globi on the Rust discord server for the idea
Found 4x speedup on my laptop with byteorder
, by reading u64
at a time, in native endian.
extern crate byteorder;
use byteorder::{NativeEndian, ReadBytesExt};
use std::io::Cursor;
pub fn one(buf: &[u8]) -> bool {
buf.into_iter().all(|&byte| byte == 0)
}
pub fn two(buf: &[u8]) -> bool {
let mut cur = Cursor::new(buf);
while let Ok(val) = cur.read_u64::<NativeEndian>() {
if val != 0 {
return false;
}
}
while let Ok(val) = cur.read_u8() {
if val != 0 {
return false;
}
}
true
}
#![feature(test)]
extern crate test;
extern crate zero_slice_8;
use zero_slice_8::{one, two};
fn v() -> Vec<u8> {
let mut result = vec![];
for _ in 0..100000 {
result.push(0);
}
result
}
#[bench]
fn bench_one(b: &mut test::Bencher) {
let v = v();
b.iter(|| one(&v[..]))
}
#[bench]
fn bench_two(b: &mut test::Bencher) {
let v = v();
b.iter(|| two(&v[..]))
}
The following function is pure save Rust:
fn is_zero ( slice : &[u8] ) -> bool {
for i in (0..slice.len()).step_by(16) {
if slice.len() - i >= 16 {
let arr : [u8; 16] = slice[i..i+16].try_into().expect("this should always succeed");
if u128::from_be_bytes(arr) != 0 {
return false;
}
} else {
for i in i..slice.len() {
if slice[i] != 0 {
return false;
}
}
}
}
return true;
}
Specifically, it uses the u128::from_be_bytes
function to convert a [u8; 16]
array to a u128
as a non-op, and uses the TryInto
trait to turn a [u8]
of appropriate length into a [u8; 16]
— the rest is fairly trivial. It is possible to manually unroll the inner loop to convert it, but I doubt this will be a significant performance bottleneck that the u8
s that form the tail of the list that aren't cleanly 16 bytes would work.
Depending on the processor, using u64
or even u32
might be faster, one would have to profile that for oneself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With