Superset Search

Tags:

I'm looking for an algorithm to solve the following in a reasonable amount of time.

Given a set of sets, find all such sets that are subsets of a given set.

For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.

I have found two solutions that are adequate:

Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.
Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.

What's your solution?

958

asked Aug 11 '09 23:08

Apocalisp

1 Answers

The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).

For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).

If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.

166

answered Oct 20 '22 17:10

Jonathan Graehl

Related questions
                            
                                Course Scheduling Algorithms: why use of DFS or Graph coloring is not suggested?
                            
                                Equality of two binary search trees constructed from unordered arrays
                            
                                Iterators concatenation performance
                            
                                What is a better algorithm for finding routes that traverse all vertices in a graph?
                            
                                Pattern or algorithm to merge branches in tree structure?
                            
                                Implementing De Boors algorithm for finding points on a B-spline
                            
                                Optimal 4 Word Placement Inside Arbitrarily Sized Grid [duplicate]
                            
                                DiGraph: Nearest node that joins all paths
                            
                                Connect points from set in the line segments
                            
                                Trying to implement a kind of traveller algorithm in Java
                            
                                Random path generation algorithm
                            
                                How to find a missing number from a string of digits without spaces between them?
                            
                                Which algorithm to choose for a huge integer multiplication, depending on N size
                            
                                Optimal way to sort a list by reversing sublists
                            
                                How to traverse through all possible paths to a solution and pick the optimum path
                            
                                Getting touch coordinates not accurate in ImageView FloodFill Algorithm
                            
                                Java Calculator - Shunting yard
                            
                                Slow Sums Algorithm
                            
                                Find k out of n subset with maximal area
                            
                                What is currently considered the "best" algorithm for 2D point-matching?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Superset Search

Tags:

language-agnostic

algorithm

indexing

data-structures

set

Apocalisp

People also ask

1 Answers

Jonathan Graehl

Recent Activity

Donate For Us