Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c++ Algorithm to Compare various length vectors and isolate "unique", sort of

Tags:

c++

I have a complex problem and have been trying to identify what needs to be a very, very efficient algorithm. I'm hoping i can get some ideas from you helpful folks. Here is the situation.

I have a vector of vectors. These nested vectors are of various length, all storing integers in a random order, such as (pseudocode):

vector_list = {
    { 1, 4, 2, 3 },
    { 5, 9, 2, 1, 3, 3 },
    { 2, 4, 2 },
    ...,
    100 more,
    { 8, 2, 2, 4 }
}

and so on, up to over 100 different vectors at a time inside vector_list. Note that the same integer can appear in each vector more than once. I need to remove from this vector_list any vectors that are duplicates of another vector. A vector is a duplicate of another vector if:

  1. It has the same integers as the other vector (regardless of order). So if we have

      vec1 = { 1, 2, 3 }
      vec2 = { 2, 3, 1 }
    

These are duplicates and I need to remove one of them, it doesnt matter which one.

  1. A vector contains all of the other integers of the other vector. So if we have

      vec1 = { 3, 2, 2 }
      vec2 = { 4, 2, 3, 2, 5 }
    

Vec2 has all of the ints of vec1 and is bigger, so i need to delete vec1 in favor of vec2

The problem is as I mentioned the list of vectors can be very big, over 100, and the algorithm may need to run as many as 1000 times on a button click, with a different group of 100+ vectors over 1000 times. Hence the need for efficiency. I have considered the following:

  1. Sorting the vectors may make life easier, but as I said, this has to be efficient, and i'd rather not sort if i didnt have to.

  2. It's more complicated by the fact that the vectors aren't in any order with respect to their size. For example, if the vectors in the list were ordered by size:

    vector_list = {
        {           },
        {           },
        {        },
        {        },
        {      },
        ...
        { },
        { }
    }
    

It might make life easier, but that seems like it would take a lot of effort and I'm not sure about the gain.

The best effort I've had so far to try and solve this problem is:

// list of vectors, just 4 for illustration, but in reality more like 100, with lengths from 5 to 15 integers long
std::vector<std::vector<int>> vector_list;
vector_list.push_back({9});
vector_list.push_back({3, 4, 2, 8, 1});
vector_list.push_back({4, 2});
vector_list.push_back({1, 3, 2, 4});

std::vector<int>::iterator it;
int i;
int j;
int k;

// to test if a smaller vector is a duplicate of a larger vector, i copy the smaller vector, then
// loop through ints in the larger vector, seeing if i can find them in the copy of the smaller. if i can,
// i remove the item from the smaller copy, and if the size of the smaller copy reaches 0, then the smaller vector
// was a duplicate of the larger vector and can be removed.
std::vector<int> copy;

// flag for breaking a for loop below
bool erased_i;

// loop through vector list
for ( i = 0; i < vector_list.size(); i++ ) 
{
    // loop again, so we can compare every vector to every other vector
    for ( j = 0; j < vector_list.size(); j++ ) 
    {
        // don't want to compare a vector to itself
        if ( i != j ) 
        {
            // if the vector in i loop is at least as big as the vector in j loop
            if ( vector_list[i].size() >= vector_list[j].size() ) 
            {
                // copy the smaller j vector
                copy = vector_list[j];
                
                // loop through each item in the larger i vector
                for ( k = 0; k < vector_list[i].size(); k++ ) {
                
                    // if the item in the larger i vector is in the smaller vector, 
                    // remove it from the smaller vector
                    it = std::find(copy.begin(), copy.end(), vector_list[i][k]);
                    
                    if (it != copy.end()) 
                    {
                        // erase
                        copy.erase(it);
                        
                        
                        // if the smaller vector has reached size 0, then it must have been a smaller duplicate that
                        // we can delete
                        if ( copy.size() == 0 ) {
                            vector_list.erase(vector_list.begin() + j);
                            j--;
                        }
                    }
                 
                }
            }
            else
            {
                // otherwise vector j must be bigger than vector i, so we do the same thing
                // in reverse, trying to erase vector i
                copy = vector_list[i];
            
                erased_i = false;
                for ( k = 0; k < vector_list[j].size(); k++ ) {
                    it = std::find(copy.begin(), copy.end(), vector_list[j][k]);
                    if (it != copy.end()) {
                        copy.erase(it);
                        if ( copy.size() == 0 ) {
                            vector_list.erase(vector_list.begin() + i);
                            
                            // put an extra flag so we break out of the j loop as well as the k loop
                            erased_i = true;
                            break;
                        }
                    }

                }
                if ( erased_i ) {
                    // break the j loop because we have to start over with whatever
                    // vector is now in position i
                    break;
                }
            }
        }
    }
}

std::cout << "ENDING VECTORS\n";
// TERMINAL OUTPUT:

vector_list[0]
[9]
vector_list[1]
[3, 4, 2, 8, 1]

So this function gives me the right results, as these are the 2 unique vectors. It also gives me the correct results if i push the initial 4 vectors in reverse order, so the smallest one comes last for example. But it feels so inefficient comparing every vector to every other vector. Plus i have to create these "copies" and try to reduce them to 0 .size() with every comparison I make. very inefficient.

Anyways, any ideas on how I could make this speedier would be much appreciated. Maybe some kind of organization by vector length, I dunno.... It seems wasteful to compare them all to each other.

Thanks!

like image 540
Geoff L Avatar asked Jul 14 '20 06:07

Geoff L


1 Answers

Loop through the vectors and for each vector, map the count of unique values occurring in it. unordered_map<int, int> would suffice for this, let's call it M.

Also maintain a set<unordered_map<int, int>>, say S, ordered by the size of unordered_map<int, int> in decreasing order.

Now we will have to compare contents of M with the contents of unordered_maps in S. Let's call M', the current unordered_map in S being compared with M. M will be a subset of M' only when the count of all the elements in M is less than or equal to the count of their respective elements in M'. If that's the case then it's a duplicate and we'll not insert. For any other case, we'll insert. Also notice that if the size of M is greater than the size of M', M can't be a subset of M'. That means we can insert M in S. This can be used as a pre-condition to speed things up. Maintain the indices of vectors which weren't inserted in S, these are the duplicates and have to be deleted from vector_list in the end.

Time Complexity: O(N*M) + O(N^2*D) + O(N*log(N)) = O(N^2*D) where N is the number of vectors in vector_list, M is the average size of the vectors in vector_list and D is the average size of unordered_map's in S. This is for the worst case when there aren't any duplicates. For average case, when there are duplicates, the second complexity will come down.

Edit: The above procedure will create a problem. To fix that, we'll need to make unordered_maps of all vectors, store them in a vector V, and sort that vector in decreasing order of the size of unordered_map. Then, we'll start from the biggest in this vector and apply the above procedure on it. This is necessary because, a subset, say M1 of a set M2, can be inserted into S before M2 if the respective vector of M1 comes before the respective vector of M2 in vector_list. So now we don't really need S, we can compare them within V itself. Complexity won't change.

Edit 2: The same problem will occur again if sizes of two unordered_maps are the same in V when sorting V. To fix that, we'll need to keep the contents of unordered_maps in some order too. So just replace unordered_map with map and in the comparator function, if the size of two maps is the same, compare element by element and whenever the keys are not the same for the very first time or are same but the M[key] is not the same, put the bigger element before the other in V.

Edit 3: New Time Complexity: O(N*M*log(D)) + O(N*D*log(N)) + O(N^2*D*log(D)) = O(N^2*D*log(D)). Also you might want to pair the maps with the index of the respective vectors in vector_list so as to know which vector you must delete from vector_list when you find a duplicate in V.

IMPORTANT: In sorted V, we must start checking from the end just to be safe (in case we choose to delete a duplicate from vector_list as well as V whenever we encounter it). So for the last map in V compare it with the rest of the maps before it to check if it is a duplicate.

Example:

vector_list = {
  {1, 2, 3},
  {2, 3, 1},
  {3, 2, 2},
  {4, 2, 3, 2, 5},
  {1, 2, 3, 4, 6, 2},
  {2, 3, 4, 5, 6},
  {1, 5}
}

Creating maps of respective vectors:

V = {
  {1->1, 2->1, 3->1},
  {1->1, 2->1, 3->1},
  {2->2, 3->1},
  {2->2, 3->1, 4->1, 5->1},
  {1->1, 2->2, 3->1, 4->1, 6->1},
  {2->1, 3->1, 4->1, 5->1, 6->1},
  {1->1, 5->1}
}

After sorting:

V = {
  {1->1, 2->2, 3->1, 4->1, 6->1},
  {2->1, 3->1, 4->1, 5->1, 6->1},
  {2->2, 3->1, 4->1, 5->1},
  {1->1, 2->1, 3->1},
  {1->1, 2->1, 3->1},
  {1->1, 5->1},
  {2->2, 3->1}
}

After deleting duplicates:

V = {
  {1->1, 2->2, 3->1, 4->1, 6->1},
  {2->1, 3->1, 4->1, 5->1, 6->1},
  {2->2, 3->1, 4->1, 5->1},
  {1->1, 5->1}
}

Edit 4: I tried coding it up. Running it a 1000 times on a list of 100 vectors, the size of each vector being in range [1-250], the range of the elements of vector being [0-50] and assuming the input is available for all the 1000 times, it takes around 2 minutes on my machine. It goes without saying that there is room for improvement in my code (and my machine).

like image 169
srt1104 Avatar answered Nov 12 '22 00:11

srt1104