Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly checking if set is superset of stored sets

The problem

I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.

Or, in terms of sets instead of arrays:

Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.

Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.

Some context

I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:

  • N = 10000
  • C = 1000
  • The stored arrays are sparse
  • The looked up arrays are random (so not sparse)

What I've come up with so far

  1. For O(NC) lookup: Just iterate all the arrays. This is just too slow though.

  2. For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.

I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.

EDIT: A new idea I've come up with

  • For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.

    My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.

I will try out both this new idea and the BDD method, and see which of the 2 work out best.

But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

like image 641
Migi Avatar asked Feb 19 '12 20:02

Migi


People also ask

How do I find my superset?

The superset relationship is represented using the symbol “⊃”. For instance, the set A is the superset of set B, and it is symbolically represented by A ⊃ B. Then X is the superset of Y (X⊃Y). In other words, we can say that Y is a subset of X (Y⊂X).

What is the difference between superset and subset in Python?

If set A contains set B, then A is said to be the superset of B. If set A is contained in set B then A is said to be the subset of B. It is denoted by the symbol A⊃B or A⊇B. Ans.

What is strict superset?

A strict superset has at least one element that does not exist in its subset. Example. Set is a strict superset of set .


3 Answers

Just to add some background information to the prefix trie solution, recently I found the following paper:

I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.

The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.

For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.

like image 137
mmihaltz Avatar answered Oct 19 '22 22:10

mmihaltz


I think prefix trie is a great start.

Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.

Independently, you can apply for each array a hash function preserving ordering.

Ie : B ⊂ A ⇒ h(B) ≺ h(A)

ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).

like image 21
YvesgereY Avatar answered Oct 20 '22 00:10

YvesgereY


You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.

From there I would use some kind of ID3 or C4.5 algorithm.

like image 2
mitchus Avatar answered Oct 19 '22 22:10

mitchus