Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way of searching array of sets

I have an array containing 100,000 sets. Each set contains natural numbers below 1,000,000. I have to find the number of ordered pairs {m, n}, where 0 < m < 1,000,000, 0 < n < 1,000,000 and m != n, which do not exist together in any of 100,000 sets. A naive method of searching through all the sets leads to 10^5 * (10^6 choose 2) number of searches.

For example I have 2 sets set1 = {1,2,4} set2 = {1,3}. All possible ordered pairs of numbers below 5 are {1,2}, {1,3}, {1,4}, {2,3}, {2,4} and {3,4}. The ordered pairs of numbers below 5 which do not exist together in set 1 are {1,3},{2,3} and {3,4}. The ordered pairs below 5 missing in set 2 are {1,2},{1,4},{2,3},{2,4} and {3,4}. The ordered pairs which do not exist together in both the sets are {2,3} and {3,4}. So the count of number of ordered pairs missing is 2.

Can anybody point me to a clever way of organizing my data structure so that finding the number of missing pairs is faster? I apologize in advance if this question has been asked before.

Update: Here is some information about the structure of my data set. The number of elements in each set varies from 2 to 500,000. The median number of elements is around 10,000. The distribution peaks around 10,000 and tapers down in both direction. The union of the elements in the 100,000 sets is close to 1,000,000.

like image 794
Sandeep Avatar asked Aug 13 '16 16:08

Sandeep


3 Answers

If you are looking for combinations across sets, there is a way to meaningfully condense your dataset, as shown in frenzykryger's answer. However, from your examples, what you're looking for is the number of combinations available within each set, meaning each set contains irreducible information. Additionally, you can't use combinatorics to simply obtain the number of combinations from each set either; you ultimately want to deduplicate combinations across all sets, so the actual combinations matter.

Knowing all this, it is difficult to think of any major breakthroughs you could make. Lets say you have i sets and a maximum of k items in each set. The naive approach would be:

  • If your sets are typically dense (i.e. contain most of the numbers between 1 and 1,000,000), replace them with the complement of the set instead
  • Create a set of 2 tuples (use a set structure that ensures insertion is idempotent)
  • For each set O(i):
    • Evaluate all combinations and insert into set of combinations: O(k choose 2)

The worst case complexity for this isn't great, but assuming you have scenarios where a set either contains most of the numbers between 0 and 1,000,000, or almost none of them, you should see a big improvement in performance.

Another approach would be to go ahead and use combinatorics to count the number of combinations from each set, then use some efficient approach to find the number of duplicate combinations among sets. I'm not aware of such an approach, but it is possible it exists.

like image 168
Asad Saeeduddin Avatar answered Oct 21 '22 11:10

Asad Saeeduddin


First lets solve more simple task of counting number of elements not present in your sets. This task can be reworded in more simple form - instead of 100,000 sets you can think about 1 set which contains all your numbers. Then number of elements not present in this set is x = 1000000 - len(set). Now you can use this number x to count number of combinations. With repetitions: x * x, without repetitions: x * (x - 1). So bottom line of my answer is to put all your numbers in one big set and use it's length to find number of combinations using combinatorics.

Update

So above we have a way to find number of combinations where each element in combination is not in any of the sets. But question was to find number of combinations where each combination is not present in any of the sets.

Lets try to solve simpler problem first:

  • your sets have all numbers in them, none missing
  • each number is present exactly in one set, no duplicates across sets

How you would construct such combinations over such sets? You would simply pick two elements from different sets and resulting combination would not be in any of the sets. Number of such combinations could be counted using following code (it accepts sizes of the sets):

int count_combinations(vector<int>& buckets) {
  int result = 0;
  for (int i=0; i < buckets.size(); ++i) {
    for (int j=i+1; j < buckets.size(); ++j) {
      result += buckets[i] * buckets[j];
    }
  }
  return result;
}

Now let's imagine that some numbers are missing. Then we can just add additional set with those missing numbers to our sets (as a separate set). But we also need to account that given there were n missing numbers there would be n * (n-1) combinations constructed using only these missing numbers. So following code will produce total number of combinations with account to missing numbers:

int missing_numbers = upper_bound - all_numbers.size() - 1;
int missing_combinations = missing_numbers * (missing_numbers - 1);

return missing_combinations + count_combinations(sets, missing_numbers); 

Now lets imagine we have a duplicate across two sets: {a, b, c}, {a, d}. What types of errors they will introduce? Following pairs: {a, a} - repetition, {a, d} - combination which is present in second set. So how to treat such duplicates? We need to eliminate them completely from all sets. Even single instance of a duplicate will produce combination present in some set. Because we can just pick any element from the set where duplicate was removed and produce such combination (in my example - if we will keep a in first set, then pick d from the second to produce {a, d}, if we will keep a in second set, then pick b or c from the first to produce {a, b} and {a, c}). So duplicates shall be removed.

Update

However we can't simply remove all duplicates, consider this counterexample: {a, b} {a, c} {d}. If we simply remove a we will acquire {b} {c} {d} and lost information about not-existing combination {a, d}. Consider another counterexample: {a, b} {a, b, c} {b, d}. If we simply remove duplicates we will acquire {c} {d} and lost information about {a, d}.

Also we can't simply apply such logic to pairs of sets, a simple counter example for numbers < 3: {1, 2} {1} {2}. Here number of missing combinations is 0, but we will incorrectly count in {1, 2} if we will apply duplicates removal to pair of sets. Bottom line is that I can't come up with good technique which will help to correctly handle duplicate elements across sets.

like image 27
featuredpeow Avatar answered Oct 21 '22 12:10

featuredpeow


What you can do, depending on memory requirements, is take advantage of the ordering of Set, and iterate over the values smartly. Something like the code below (untested). You'll iterate over all of your sets, and then for each of your sets you'll iterate over their values. For each of these values, you'll check all of the values in the set after them. Our complexity is reduced to the number of sets times the square of their sizes. You can use a variety of methods to keep track of your found/unfound count, but using a set should be fine, since insertion is simply O(log(n)) where n is no more than 499999500000. In theory using a map of sets (mapping based on the first value) could be slightly faster, but in either case the cost is minimal.

long long numMissing(const std::array<std::set<int>, 100000>& sets){
    std::set<pair<int, int> > found;
    for (const auto& s : sets){
        for (const auto& m : s){
            const auto &n = m;
            for (n++; n != s.cend(); n++){
                found.emplace(m, n);
            }
        }
    }
    return 499999500000 - found.size();
}
like image 2
Adam Martin Avatar answered Oct 21 '22 12:10

Adam Martin