I need to find out how many integers are present in both of two given sets, fast. The sets are written to only once but this operation will be performed many times with different pairs of sets. The sets contain 5-30 integers and the largest of these integers is 840000.
I have initially tried to iterate over one Vec
and for each element check if its present in the other Vec
. I then decided to use BTreeSet
instead since it should be significantly faster at checking if an integer is present in the set, but that does not seem to be the case. The Vec
implementation takes ~72ms and the BTreeSet ~96ms when called on couple thousands of sets in release mode under stable Rust 1.34 with same performance when using nightly.
This is the Vec
implementation:
use std::cmp;
fn main() {
let mut sets = Vec::with_capacity(1000);
for i in 1..1000 {
let mut set = Vec::new();
for j in 1..i % 30 {
set.push(i * j % 50000);
}
sets.push(set);
}
for left_set in sets.iter() {
for right_set in sets.iter() {
calculate_waste(left_set, right_set);
}
}
}
fn calculate_waste(left_nums: &Vec<usize>, right_nums: &Vec<usize>) -> usize {
let common_nums = left_nums.iter().fold(0, |intersection_count, num| {
intersection_count + right_nums.contains(num) as usize
});
let left_side = left_nums.len() - common_nums;
let right_side = right_nums.len() - common_nums;
let score = cmp::min(common_nums, cmp::min(left_side, right_side));
left_side - score + right_side - score + common_nums - score
}
And this is the BTreeSet
implementation:
use std::cmp;
use std::collections::BTreeSet;
fn main() {
let mut sets = Vec::with_capacity(1000);
for i in 1..1000 {
let mut set = BTreeSet::new();
for j in 1..i % 30 {
set.insert(i * j % 50000);
}
sets.push(set);
}
for left_set in sets.iter() {
for right_set in sets.iter() {
calculate_waste(left_set, right_set);
}
}
}
fn calculate_waste(left_nums: &BTreeSet<usize>, right_nums: &BTreeSet<usize>) -> usize {
let common_nums = left_nums.intersection(&right_nums).count();
let left_side = left_nums.len() - common_nums;
let right_side = right_nums.len() - common_nums;
let score = cmp::min(common_nums, cmp::min(left_side, right_side));
left_side - score + right_side - score + common_nums - score
}
It was ran with the command (-w 50
makes it ignore the first 50 runs):
hyperfine "cargo run --release" -w 50 -m 100
Full code of the program available here.
Is the BTreeSet
implementation slower because there are too few integers in the set to allow its O(log n) access time to shine? If so, is there anything else I can do to speed up this function?
Since your sets don't change over time, I think your best option is to use sorted vectors. Sorting the vectors will be required only once, at initialization time. The intersection of two sorted vectors can be computed in linear time by iterating over them simultaneously, always advancing the iterator that currently points to the lower number. Here is an attempt at an implementation:
fn intersection_count_sorted_vec(a: &[u32], b: &[u32]) -> usize {
let mut count = 0;
let mut b_iter = b.iter();
if let Some(mut current_b) = b_iter.next() {
for current_a in a {
while current_b < current_a {
current_b = match b_iter.next() {
Some(current_b) => current_b,
None => return count,
};
}
if current_a == current_b {
count += 1;
}
}
}
count
}
This probably isn't particularly well optimised; regardless, benchmarking with Criterion-based code indicates this version is more than three times as fast as your solution using vectors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With