Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm for finding all maximal subsets

Tags:

algorithm

set

I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:

input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}] output = [{1, 2, 3}, {2, 4}] 

I have not been able to find a standard algorithm for this, or even a name for this problem, so I am calling it "maximal subsets" for lack of anything else. Here is an O(n^2) algorithm (in Python for concreteness), assuming is_subset_func is O(1):1

def eliminate_subsets(a, cardinality_func, is_subset_func):     out = []     for element in sorted(a, reverse=True, key=cardinality_func):         for existing in out:             if is_subset_func(element, existing):                 break         else:             out.append(element)     return out 

Is there a more efficient algorithm, hopefully O(n log n) or better?


1 For bit masks of constant size, as is true in my case, is_subset_func is just element & existing == element, which runs in constant time.

like image 300
Mark Lodato Avatar asked Dec 31 '12 21:12

Mark Lodato


People also ask

How do you find the maximal subset?

First, break each number into its prime factors (or use a list of primes to generate the full list from 1 to n, same thing). Solve the subset problem with recursive descend by finding a subset that contains no common primes. Run through all solutions and find the largest one. Save this answer.

What is maximal subset?

In mathematics, especially in order theory, a maximal element of a subset S of some preordered set is an element of S that is not smaller than any other element in S. A minimal element of a subset S of some preordered set is defined dually as an element of S that is not greater than any other element in S.

How do you determine proper subsets?

A proper subset of a set A is a subset of A that is not equal to A. In other words, if B is a proper subset of A, then all elements of B are in A but A contains at least one element that is not in B. For example, if A={1,3,5} then B={1,5} is a proper subset of A.

What does it mean for a set to be maximal?

A member of a collection of sets is said to be maximal if it cannot be expanded to another member by addition of any element.


2 Answers

Suppose you label all the input sets.

A={1, 2, 3}, B={1, 2}, C={2, 3}, D={2, 4}, E={} 

Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears:

1={A,B} 2={A,B,C,D} 3={A,C} 4={D} 

Now for each input set compute the intersection of all the label sets of its elements:

For A, {A,B} intesect {A,B,C,D} intersect {A,C} = {A}   (*) 

If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But,

For C, {A,B,C,D} intersect {A,C} = {A,C}, which means that it's a subset of A. 

The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at (*) above requires O(mn); O(mn^2) for all.

Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors.

Addition: On Line Version

The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one.

The key is to note that A is a superset of B iff A' is a subset of B' (the tick' denoting set complement).

Following this inspiration, we maintain the intermediate set as before. When a new input set S arrives, do the same test as described above: Let I(e) be the intermediate set for input element e. Then this test is

For X = \intersect_{e \in S} . I(e), |X| > 0 

(In this case it's greater than zero rather than one as above because S is not yet in I.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded.

Otherwise we must add S as a new maximal set, but before doing this, compute:

Y = \intersect_{e \in S'} . I'(e) = ( \union_{e \in S'} . I(e) )' 

where again the tick' is set complement. The union form may be a bit faster to compute. Y contains the maximal sets that have been superceded by S. They must be removed from the maximal collection and from I. Finally add S as a maximal set and update I with S's elements.

Let's work through our example. When A arrives, we add it to I and have

1={A}  2={A}  3={A} 

When B arrives, we find X = {A} intersect {A} = {A}, so throw B away and continue. The same happens for C. When D arrives we find X = {A} intersect {} = {}, so continue with Y = I'(1) intersect I'(3) = {} intersect {}. This correctly tells us that maximal set A is not contained in D, so there is nothing to delete. But it must be added as a new maximal set, and I becomes

1={A}  2={A,D}  3={A}  4={D} 

The arrival of E causes no change. Posit the arrival then of a new set F={2, 3, 4, 5}. We find

X = {A} isect {A,D} isect {A} isect {D} isect {} 

so we cannot throw F away. Continue with

Y = \intersect_{e in {1}} I'(e) = I'(1) = {D} 

This tells us D is a subset of F, so should be discarded while F is added, leaving

1={A} 2={A,F} 3={A,F} 4={F} 5={F} 

The computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time.

I hope this is helpful.

Summary

The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map p[i] that tells you what position the i'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given location a, you must search p for a, a linear time operation. On the other hand, an inverse map pi such that pi[p[i]] == i takes no longer to compute than does p (so its cost is "hidden"), but pi[a] produces the desired result in constant time.

Implementation by Original Poster

import collections import operator from functools import reduce # only in Python 3  def is_power_of_two(n):     """Returns True iff n is a power of two.  Assumes n > 0."""     return (n & (n - 1)) == 0  def eliminate_subsets(sequence_of_sets):     """Return a list of the elements of `sequence_of_sets`, removing all     elements that are subsets of other elements.  Assumes that each     element is a set or frozenset and that no element is repeated."""     # The code below does not handle the case of a sequence containing     # only the empty set, so let's just handle all easy cases now.     if len(sequence_of_sets) <= 1:         return list(sequence_of_sets)     # We need an indexable sequence so that we can use a bitmap to     # represent each set.     if not isinstance(sequence_of_sets, collections.Sequence):         sequence_of_sets = list(sequence_of_sets)     # For each element, construct the list of all sets containing that     # element.     sets_containing_element = {}     for i, s in enumerate(sequence_of_sets):         for element in s:             try:                 sets_containing_element[element] |= 1 << i             except KeyError:                 sets_containing_element[element] = 1 << i     # For each set, if the intersection of all of the lists in which it is     # contained has length != 1, this set can be eliminated.     out = [s for s in sequence_of_sets            if s and is_power_of_two(reduce(                operator.and_, (sets_containing_element[x] for x in s)))]     return out 
like image 161
16 revs, 3 users 75% Avatar answered Sep 28 '22 00:09

16 revs, 3 users 75%


This problem has been studied in literature. Given S_1,...,S_k which are subsets of {1,...,n}, Yellin [1] gave an algorithm to find the maximal subset of {S_1,...,S_k} in time O(kdm) where d is the average size of the S_i, and m is the cardinality of the the maximal subset of {S_1,...,S_k}. This was later improved for some range of parameters by Yellin and Jutla [2] to O((kd)^2/sqrt(log(kd))). It is believed that a truly sub-quadratic algorithm to this problem does not exist.

[1] Daniel M. Yellin: Algorithms for Subset Testing and Finding Maximal Sets. SODA 1992: 386-392.

[2] Daniel M. Yellin, Charanjit S. Jutla: Finding Extremal Sets in Less than Quadratic Time. Inf. Process. Lett. 48(1): 29-34 (1993).

like image 22
Karthik C. S. Avatar answered Sep 28 '22 01:09

Karthik C. S.