I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example: <pre class="prettyprint"><code>input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}] output = [{1, 2, 3}, {2, 4}] </code></pre> I have not been able to find a standard algorithm for this, or even a name for this problem, so I am calling it "maximal subsets" for lack of anything else. Here is an O(n^2) algorithm (in Python for concreteness), assuming <code>is_subset_func</code> is O(1):1 <pre class="prettyprint lang-py prettyprint-override"><code>def eliminate_subsets(a, cardinality_func, is_subset_func): out = [] for element in sorted(a, reverse=True, key=cardinality_func): for existing in out: if is_subset_func(element, existing): break else: out.append(element) return out </code></pre> Is there a more efficient algorithm, hopefully O(n log n) or better? <hr> 1 For bit masks of constant size, as is true in my case, <code>is_subset_func</code> is just <code>element & existing == element</code>, which runs in constant time.

Suppose you label all the input sets. <pre class="prettyprint"><code>A={1, 2, 3}, B={1, 2}, C={2, 3}, D={2, 4}, E={} </code></pre> Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears: <pre class="prettyprint"><code>1={A,B} 2={A,B,C,D} 3={A,C} 4={D} </code></pre> Now for each input set compute the intersection of all the label sets of its elements: <pre class="prettyprint"><code>For A, {A,B} intesect {A,B,C,D} intersect {A,C} = {A} (*) </code></pre> If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But, <pre class="prettyprint"><code>For C, {A,B,C,D} intersect {A,C} = {A,C}, which means that it's a subset of A. </code></pre> The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at <code>(*)</code> above requires O(mn); O(mn^2) for all. Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors. Addition: On Line Version The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one. The key is to note that <code>A</code> is a superset of <code>B</code> iff <code>A'</code> is a subset of <code>B'</code> (the tick' denoting set complement). Following this inspiration, we maintain the intermediate set as before. When a new input set <code>S</code> arrives, do the same test as described above: Let <code>I(e)</code> be the intermediate set for input element <code>e</code>. Then this test is <pre class="prettyprint"><code>For X = \intersect_{e \in S} . I(e), |X| > 0 </code></pre> (In this case it's greater than zero rather than one as above because <code>S</code> is not yet in <code>I</code>.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded. Otherwise we must add <code>S</code> as a new maximal set, but before doing this, compute: <pre class="prettyprint"><code>Y = \intersect_{e \in S'} . I'(e) = ( \union_{e \in S'} . I(e) )' </code></pre> where again the tick' is set complement. The union form may be a bit faster to compute. <code>Y</code> contains the maximal sets that have been superceded by <code>S</code>. They must be removed from the maximal collection and from <code>I</code>. Finally add <code>S</code> as a maximal set and update <code>I</code> with <code>S</code>'s elements. Let's work through our example. When <code>A</code> arrives, we add it to <code>I</code> and have <pre class="prettyprint"><code>1={A} 2={A} 3={A} </code></pre> When <code>B</code> arrives, we find <code>X = {A} intersect {A} = {A}</code>, so throw <code>B</code> away and continue. The same happens for <code>C</code>. When <code>D</code> arrives we find <code>X = {A} intersect {} = {}</code>, so continue with <code>Y = I'(1) intersect I'(3) = {} intersect {}</code>. This correctly tells us that maximal set <code>A</code> is not contained in <code>D</code>, so there is nothing to delete. But it must be added as a new maximal set, and <code>I</code> becomes <pre class="prettyprint"><code>1={A} 2={A,D} 3={A} 4={D} </code></pre> The arrival of <code>E</code> causes no change. Posit the arrival then of a new set <code>F={2, 3, 4, 5}</code>. We find <pre class="prettyprint"><code>X = {A} isect {A,D} isect {A} isect {D} isect {} </code></pre> so we cannot throw <code>F</code> away. Continue with <pre class="prettyprint"><code>Y = \intersect_{e in {1}} I'(e) = I'(1) = {D} </code></pre> This tells us <code>D</code> is a subset of <code>F</code>, so should be discarded while <code>F</code> is added, leaving <pre class="prettyprint"><code>1={A} 2={A,F} 3={A,F} 4={F} 5={F} </code></pre> The computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time. I hope this is helpful. Summary The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map <code>p[i]</code> that tells you what position the <code>i</code>'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given location <code>a</code>, you must search <code>p</code> for <code>a</code>, a linear time operation. On the other hand, an inverse map <code>pi</code> such that <code>pi[p[i]] == i</code> takes no longer to compute than does <code>p</code> (so its cost is "hidden"), but <code>pi[a]</code> produces the desired result in constant time. Implementation by Original Poster <pre class="prettyprint lang-py prettyprint-override"><code>import collections import operator from functools import reduce # only in Python 3 def is_power_of_two(n): """Returns True iff n is a power of two. Assumes n > 0.""" return (n & (n - 1)) == 0 def eliminate_subsets(sequence_of_sets): """Return a list of the elements of `sequence_of_sets`, removing all elements that are subsets of other elements. Assumes that each element is a set or frozenset and that no element is repeated.""" # The code below does not handle the case of a sequence containing # only the empty set, so let's just handle all easy cases now. if len(sequence_of_sets) <= 1: return list(sequence_of_sets) # We need an indexable sequence so that we can use a bitmap to # represent each set. if not isinstance(sequence_of_sets, collections.Sequence): sequence_of_sets = list(sequence_of_sets) # For each element, construct the list of all sets containing that # element. sets_containing_element = {} for i, s in enumerate(sequence_of_sets): for element in s: try: sets_containing_element[element] |= 1 << i except KeyError: sets_containing_element[element] = 1 << i # For each set, if the intersection of all of the lists in which it is # contained has length != 1, this set can be eliminated. out = [s for s in sequence_of_sets if s and is_power_of_two(reduce( operator.and_, (sets_containing_element[x] for x in s)))] return out </code></pre>

Efficient algorithm for finding all maximal subsets

Tags:

algorithm

set

I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:

input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}] output = [{1, 2, 3}, {2, 4}]

I have not been able to find a standard algorithm for this, or even a name for this problem, so I am calling it "maximal subsets" for lack of anything else. Here is an O(n^2) algorithm (in Python for concreteness), assuming is_subset_func is O(1):¹

def eliminate_subsets(a, cardinality_func, is_subset_func):     out = []     for element in sorted(a, reverse=True, key=cardinality_func):         for existing in out:             if is_subset_func(element, existing):                 break         else:             out.append(element)     return out

Is there a more efficient algorithm, hopefully O(n log n) or better?

¹ For bit masks of constant size, as is true in my case, is_subset_func is just element & existing == element, which runs in constant time.

300

asked Dec 31 '12 21:12

Mark Lodato

2 Answers

Suppose you label all the input sets.

A={1, 2, 3}, B={1, 2}, C={2, 3}, D={2, 4}, E={}

Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears:

1={A,B} 2={A,B,C,D} 3={A,C} 4={D}

Now for each input set compute the intersection of all the label sets of its elements:

For A, {A,B} intesect {A,B,C,D} intersect {A,C} = {A}   (*)

If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But,

For C, {A,B,C,D} intersect {A,C} = {A,C}, which means that it's a subset of A.

The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at (*) above requires O(mn); O(mn^2) for all.

Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors.

Addition: On Line Version

The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one.

The key is to note that A is a superset of B iff A' is a subset of B' (the tick' denoting set complement).

Following this inspiration, we maintain the intermediate set as before. When a new input set S arrives, do the same test as described above: Let I(e) be the intermediate set for input element e. Then this test is

For X = \intersect_{e \in S} . I(e), |X| > 0

(In this case it's greater than zero rather than one as above because S is not yet in I.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded.

Otherwise we must add S as a new maximal set, but before doing this, compute:

Y = \intersect_{e \in S'} . I'(e) = ( \union_{e \in S'} . I(e) )'

where again the tick' is set complement. The union form may be a bit faster to compute. Y contains the maximal sets that have been superceded by S. They must be removed from the maximal collection and from I. Finally add S as a maximal set and update I with S's elements.

Let's work through our example. When A arrives, we add it to I and have

1={A}  2={A}  3={A}

When B arrives, we find X = {A} intersect {A} = {A}, so throw B away and continue. The same happens for C. When D arrives we find X = {A} intersect {} = {}, so continue with Y = I'(1) intersect I'(3) = {} intersect {}. This correctly tells us that maximal set A is not contained in D, so there is nothing to delete. But it must be added as a new maximal set, and I becomes

1={A}  2={A,D}  3={A}  4={D}

The arrival of E causes no change. Posit the arrival then of a new set F={2, 3, 4, 5}. We find

X = {A} isect {A,D} isect {A} isect {D} isect {}

so we cannot throw F away. Continue with

Y = \intersect_{e in {1}} I'(e) = I'(1) = {D}

This tells us D is a subset of F, so should be discarded while F is added, leaving

1={A} 2={A,F} 3={A,F} 4={F} 5={F}

The computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time.

I hope this is helpful.

Summary

The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map p[i] that tells you what position the i'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given location a, you must search p for a, a linear time operation. On the other hand, an inverse map pi such that pi[p[i]] == i takes no longer to compute than does p (so its cost is "hidden"), but pi[a] produces the desired result in constant time.

Implementation by Original Poster

import collections import operator from functools import reduce # only in Python 3  def is_power_of_two(n):     """Returns True iff n is a power of two.  Assumes n > 0."""     return (n & (n - 1)) == 0  def eliminate_subsets(sequence_of_sets):     """Return a list of the elements of `sequence_of_sets`, removing all     elements that are subsets of other elements.  Assumes that each     element is a set or frozenset and that no element is repeated."""     # The code below does not handle the case of a sequence containing     # only the empty set, so let's just handle all easy cases now.     if len(sequence_of_sets) <= 1:         return list(sequence_of_sets)     # We need an indexable sequence so that we can use a bitmap to     # represent each set.     if not isinstance(sequence_of_sets, collections.Sequence):         sequence_of_sets = list(sequence_of_sets)     # For each element, construct the list of all sets containing that     # element.     sets_containing_element = {}     for i, s in enumerate(sequence_of_sets):         for element in s:             try:                 sets_containing_element[element] |= 1 << i             except KeyError:                 sets_containing_element[element] = 1 << i     # For each set, if the intersection of all of the lists in which it is     # contained has length != 1, this set can be eliminated.     out = [s for s in sequence_of_sets            if s and is_power_of_two(reduce(                operator.and_, (sets_containing_element[x] for x in s)))]     return out

161

answered Sep 28 '22 00:09

16 revs, 3 users 75%

This problem has been studied in literature. Given S_1,...,S_k which are subsets of {1,...,n}, Yellin [1] gave an algorithm to find the maximal subset of {S_1,...,S_k} in time O(kdm) where d is the average size of the S_i, and m is the cardinality of the the maximal subset of {S_1,...,S_k}. This was later improved for some range of parameters by Yellin and Jutla [2] to O((kd)^2/sqrt(log(kd))). It is believed that a truly sub-quadratic algorithm to this problem does not exist.

[1] Daniel M. Yellin: Algorithms for Subset Testing and Finding Maximal Sets. SODA 1992: 386-392.

[2] Daniel M. Yellin, Charanjit S. Jutla: Finding Extremal Sets in Less than Quadratic Time. Inf. Process. Lett. 48(1): 29-34 (1993).

answered Sep 28 '22 01:09

Karthik C. S.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient algorithm for finding all maximal subsets

Tags:

algorithm

set

Mark Lodato

People also ask

2 Answers

16 revs, 3 users 75%

Karthik C. S.

Recent Activity

Donate For Us

Efficient algorithm for finding all maximal subsets

Tags:

algorithm

set

Mark Lodato

People also ask

2 Answers

16 revs, 3 users 75%

Karthik C. S.

Related questions

Recent Activity

Donate For Us