Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

performing vector intersection in C++

I have a vector of vector of unsigned. I need to find the intersection of all these vector of unsigned's for doing so I wrote the following code:

int func()
{
   vector<vector<unsigned> > t;
   vector<unsigned> intersectedValues;
   bool firstIntersection=true;
   for(int i=0;i<(t).size();i++)
   {
       if(firstIntersection)
       {
           intersectedValues=t[0];
           firstIntersection=false;
       }else{
           vector<unsigned> tempIntersectedSubjects;                                                              
           set_intersection(t[i].begin(),
                  t[i].end(), intersectedValues.begin(),
                  intersectedValues.end(),
                  std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.begin()));
           intersectedValues=tempIntersectedSubjects;
       }         
       if(intersectedValues.size()==0)
           break;
   }               
}

Each individual vector has 9000 elements and there are many such vectors in "t". When I profiled my code I found that set_intersection takes the maximum amount of time and hence makes the code slow when there are many invocations of func(). Can someone please suggest as to how can I make the code more efficient.

I am using: gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)

EDIT: Individual vectors in vector "t" are sorted.

like image 331
Steg Verner Avatar asked May 09 '15 14:05

Steg Verner


2 Answers

I don't have a framework to profile the operations but I'd certainly change the code to reuse the readily allocated vector. In addition, I'd hoist the initial intersection out of the loop. Also, std::back_inserter() should make sure that elements are added in the correct location rather than in the beginning:

int func()
{
    vector<vector<unsigned> > t = some_initialization();
    if (t.empty()) {
        return;
    }
    vector<unsigned> intersectedValues(t[0]);
    vector<unsigned> tempIntersectedSubjects;
    for (std::vector<std::vector<unsigned>>::size_type i(1u);
         i < t.size() && !intersectedValues.empty(); ++i) {
        std::set_intersection(t[i].begin(), t[i].end(),
                              intersectedValues.begin(), intersectedValues.end(),
                             std::back_inserter(tempIntersectedSubjects);
        std::swap(intersectedValues, tempIntersectedSubjects);
        tempIntersectedSubjects.clear();
    }
}               

I think this code has a fair chance to be faster. It may also be reasonable to intersect the sets different: instead of keeping one set and intersecting with that you could create a new intersection for pairs of adjacent sets and then intersect the first sets with their respect adjacent ones:

std::vector<std::vector<unsigned>> intersections(
    std::vector<std::vector<unsigned>> const& t) {
    std::vector<std::vector<unsigned>> r;
    std::vector<std::vector<unsignned>>::size_type i(0);
    for (; i + 1 < t.size(); i += 2) {
        r.push_back(intersect(t[i], t[i + 1]));
    }
    if (i < t.size()) {
        r.push_back(t[i]);
    }
    return r;
}

std::vector<unsigned> func(std::vector<std::vector<unsigned>> const& t) {
    if (t.empty()) { /* deal with t being empty... */ }
    std::vector<std::vector<unsigned>> r(intersections(t))
    return r.size() == 1? r[0]: func(r);
}

Of course, you wouldn't really implement it like this: you'd use Stepanov's binary counter to keep the intermediate sets. This approach assumes that the result is most likely non-empty. If the expectation is that the result will be empty that may not be an improvement.

like image 65
Dietmar Kühl Avatar answered Oct 02 '22 20:10

Dietmar Kühl


I can't test this but maybe something like this would be faster?

int func()
{
   vector<vector<unsigned> > t;
   vector<unsigned> intersectedValues;

   // remove if() branching from loop

   if(t.empty())
       return -1;

   intersectedValues = t[0];

   // now start from 1
   for(size_t i = 1; i < t.size(); ++i)
   {
       vector<unsigned> tempIntersectedSubjects;
       tempIntersectedSubjects.reserve(intersectedValues.size()); // pre-allocate

       // insert at end() not begin()
       set_intersection(t[i].begin(),
              t[i].end(), intersectedValues.begin(),
              intersectedValues.end(),
              std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));

       // as these are not used again you can move them rather than copy
       intersectedValues = std::move(tempIntersectedSubjects);

       if(intersectedValues.empty())
           break;
   }
   return 0;
}

Another possibility:

Thinking about it using swap() could optimize the exchange of data and remove the need to re-allocate. Also then the temp constructor can be moved out of the loop.

int func()
{
   vector<vector<unsigned> > t;
   vector<unsigned> intersectedValues;

   // remove if() branching from loop

   if(t.empty())
       return -1;

   intersectedValues = t[0];

   // no need to construct this every loop
   vector<unsigned> tempIntersectedSubjects;

   // now start from 1
   for(size_t i = 1; i < t.size(); ++i)
   {
       // should already be the correct size from previous loop
       // but just in case this should be cheep
       // (profile removing this line)
       tempIntersectedSubjects.reserve(intersectedValues.size());

       // insert at end() not begin()
       set_intersection(t[i].begin(),
              t[i].end(), intersectedValues.begin(),
              intersectedValues.end(),
              std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));

       // swap should leave tempIntersectedSubjects preallocated to the
       // correct size
       intersectedValues.swap(tempIntersectedSubjects);
       tempIntersectedSubjects.clear(); // will not deallocate

       if(intersectedValues.empty())
           break;
   }
   return 0;
}
like image 22
Galik Avatar answered Oct 02 '22 20:10

Galik