Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient, or fast, size of the set intersection of two vectors

I find myself needing to return the size of the intersection of two vectors:

std::vector<int> A_, B_

I do not require the intersected values, just the size of the set. This function needs to be called a very large number of times. This is part of a much bigger simulation over a (mathematical) graph/network.

My working conditions are:

  • Containers are vectors. To change them is pure pain, but would certainly do so if the gain warrants it.
  • The size of A_ and B_ have an upper bound of ~100. But are often much smaller.
  • Elements of A_ and B_ represent samples taken from {1,2,...,M}, where M >10,000.
  • In general, A_ and B_ have similar, but unequal, sizes.
  • Both vectors are unordered.
  • The contents of A_ and B_ change, as part of the "bigger simulation".
  • Each vector contains only unique elements i.e. no repeats.

My first attempt, using a naive loop, is below. But I think this may not be enough. I've assumed...that std::set_intersection will be too onerous due to repeated sorts and allocations.

   int vec_intersect(const std::vector<int>& A_, const std::vector<int>& B_) {

      int c_count=0;

  for(std::vector<int>::const_iterator it = A_.begin(); it != A_.end(); ++it){
     for(std::vector<int>::const_iterator itb = B_.begin(); itb != B_.end(); ++itb){

      if(*it==*itb) ++c_count;
     }
  }

  return c_count;
}

Given my conditions above, how else can I implement this to gain speed, relatively easily? Should I be thinking about hash tables or going with sorts and STL, or different containers?

like image 487
Rusan Kax Avatar asked Jun 21 '14 01:06

Rusan Kax


2 Answers

Your algorithm is O(n2) in the number of elements (assuming that the size of both vectors is approximately equal to n). Here is an O(n) algorithm:

  • Create an std::unordered_set<int>
  • Put all items of vector A into the set
  • Go through all items of vector B, checking that they are present in the unordered_set, and incrementing the count for each item that is present.
  • Return the final count.

Here is an implementation in C++11, using a lambda for brevity:

vector<int> a {2, 3, 5, 7, 11, 13};
vector<int> b {1, 3, 5, 7, 9, 11};
unordered_set<int> s(a.begin(), a.end());
int res = count_if(b.begin(), b.end(), [&](int k) {return s.find(k) != s.end();});
// Lambda above captures the set by reference. count_if passes each element of b
// to the lambda. The lambda returns true if there is a match, and false otherwise.

(this prints 4; demo)

like image 143
Sergey Kalinichenko Avatar answered Nov 13 '22 12:11

Sergey Kalinichenko


Your algorithm is O(n*m), where n and m are the number of elements in the vectors.

If you don't have issues where the input data is untrusted, you'll probably have the best results with:

  • Place all the elements of A into an unordered_set
  • For each element in B, if it is in the set, increment your counter.

For example:

int vec_intersect(const std::vector<int>& A_, const std::vector<int>& B_)
{
    std::unordered_set<int> aSet(A_.cbegin(), A_.cend());
    return std::count_if(B_.cbegin(), B_.cend(), [&](int element) {
        return aSet.find(element) != aSet.end();
        });
}

This will probabilistically give O(m + n) results. (Hash tables are almost always O(1), but if an attacker can force many collisions in the table they could force O(n) behavior, leading to denial of service)

If you require deterministic results, and the order of the vectors does not matter, sorting one vector will work, which is only O(m lg m + m + n). That is:

  • Sort the first vector
  • For each element in the second vector, use binary search to determine if the element is in the first vector, and if so, increment your counter.

For example:

int vec_intersect(std::vector<int>& A_, const std::vector<int>& B_)
{
    std::sort(A_.begin(), A_.end());
    return std::count_if(B_.cbegin(), B_.cend(), [&](int element) {
        return std::binary_search(A_.cbegin(), A_.cend(), element);
        });
}

Just for giggles, here's an <algorithm>-ized version of your algorithm:

int vec_intersect(const std::vector<int>& A_, const std::vector<int>& B_)
{
    return std::count_if(B_.cbegin(), B_.cend(), [&](int element) {
        return std::find(A_.cbegin(), A_.cend(), element) != A_.cend();
        });
}
like image 3
Billy ONeal Avatar answered Nov 13 '22 10:11

Billy ONeal