I have a collection of std::set
. I want to find the intersection of all the sets in this collection, in the fastest manner. The number of sets in the collection is typically very small (~5-10), and the number of elements in each set is is usually less than 1000, but can occasionally go upto around 10000. But I need to do these intersections tens of thousands of time, as fast as possible. I tried to benchmark a few methods as follows:
std::set
object which initially copies the first set. Then for subsequent sets, it iterates over all element of itself and the ith set of the collection, and removes items from itself as needed.std::set_intersection
into a temporary std::set
, swap contents to a current set, then again find intersection of the current set with the next set and insert into the temp set, and so on.vector
as the destination container instead of std::set
.std::list
instead of a vector
, suspecting a list
will provide faster deletions from the middle.std::unordered_set
) and checking for all items in all sets.As it turned out, using a vector
is marginally faster when the number of elements in each set is small, and list
is marginally faster for larger sets. In-place using set
is a substantially slower than both, followed by set_intersection
and hash sets. Is there a faster algorithm/datastructure/tricks to achieve this? I can post code snippets if required. Thanks!
The intersection of sets A and B is the set of all elements which are common to both A and B. The elements common to A and B are 4 and 8.
C++ Algorithm set_intersection() function is used to find the intersection of two sorted ranges[first1, last1) and [first2, last2), which is formed only by the elements that are present in both sets.
The time complexity for the insertion of a new element is O(log N). Vector is faster for insertion and deletion of elements at the end of the container. Set is faster for insertion and deletion of elements at the middle of the container.
std::set_intersection in C++ The intersection of two sets is formed only by the elements that are present in both sets. The elements copied by the function come always from the first range, in the same order. The elements in the both the ranges shall already be ordered.
You might want to try a generalization of std::set_intersection()
: the algorithm is to use iterators for all sets:
end()
of its corresponding set, you are done. Thus, it can be assumed that all iterators are valid.x
.std::find_if()
the first element at least as big as x
.x
make it the new candidate value and search again in the sequence of iterators.x
you found an element of the intersection: Record it, increment all iterators, start over.Night is a good adviser and I think I may have an idea ;)
This is why where speeds matter, a vector
(or perhaps a deque
) are so great structures: they play very well with memory. As such, I would definitely recommend using vector
as our intermediary structures; although care need be taken to only ever insert/delete from an extremity to avoid relocation.
So I thought about a rather simple approach:
#include <cassert>
#include <algorithm>
#include <set>
#include <vector>
// Do not call this method if you have a single set...
// And the pointers better not be null either!
std::vector<int> intersect(std::vector< std::set<int> const* > const& sets) {
for (auto s: sets) { assert(s && "I said no null pointer"); }
std::vector<int> result; // only return this one, for NRVO to kick in
// 0. Check obvious cases
if (sets.empty()) { return result; }
if (sets.size() == 1) {
result.assign(sets.front()->begin(), sets.front()->end());
return result;
}
// 1. Merge first two sets in the result
std::set_intersection(sets[0]->begin(), sets[0]->end(),
sets[1]->begin(), sets[1]->end(),
std::back_inserter(result));
if (sets.size() == 2) { return result; }
// 2. Merge consecutive sets with result into buffer, then swap them around
// so that the "result" is always in result at the end of the loop.
std::vector<int> buffer; // outside the loop so that we reuse its memory
for (size_t i = 2; i < sets.size(); ++i) {
buffer.clear();
std::set_intersection(result.begin(), result.end(),
sets[i]->begin(), sets[i]->end(),
std::back_inserter(buffer));
swap(result, buffer);
}
return result;
}
It seems correct, I cannot guarantee its speed though, obviously.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With