Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to perform subset test operation on a large collection of sets with same domain

Assume we have trillions of sets stored somewhere. The domain for each of these sets is the same. It is also finite and discrete. So each set may be stored as a bit field (eg: 0000100111...) of a relatively short length (eg: 1024). That is, bit X in the bitfield indicates whether item X (of 1024 possible items) is included in the given set or not.

Now, I want to devise a storage structure and an algorithm to efficiently answer the query: what sets in the data store have set Y as a subset. Set Y itself is not present in the data store and is specified at run time.

Now the simplest way to solve this would be to AND the bitfield for set Y with bit fields of every set in the data store one by one, picking the ones whose AND result matches Y's bitfield.

How can I speed this up? Is there a tree structure (index) or some smart algorithm that would allow me to perform this query without having to AND every stored set's bitfield?

Are there databases that already support such operations on large collections of sets?

like image 864
niktech Avatar asked Dec 28 '10 00:12

niktech


2 Answers

If you can preprocess the sets, the subset relation is representable as a DAG (because you're describing a poset). If the transitive reduction is computed, then I think you can avoid testing all the sets by just performing a DFS starting from the biggest sets and stopping whenever Y is no longer a subset of the current set being visited.

like image 66
lijie Avatar answered Oct 23 '22 11:10

lijie


Depending on the cardinality of the set from which all the sets are drawn, one option might be to build an inverted index mapping from elements to the sets that contain them. Given a set Y, you could then find all sets that have Y as a subset by finding all of the sets that contain each element individually and computing their intersection. If you store the lists in sorted order (for example, by numbering all the sets in your database with values 0, 1, etc.) then you should be able to compute this intersection fairly efficiently, assuming that no one element is contained in too many sets.

like image 44
templatetypedef Avatar answered Oct 23 '22 10:10

templatetypedef