Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Data structure for finding strict subsets (from a given list)

I have a large set of sets e.g. {{2,4,5} , {4,5}, ...}. Given one of these subsets, I would like to iterate through all other subsets which are strict subsets of this subset. That is, if I am interested in set A, e.g. {2,4,5}, I want to find all sets B where the relative complement B / A = {}, the empty set. Some possibilities could be {2,4}, {2,5} but not {2,3}

I could of course search linearly and check each time, but am looking for an efficient data structure both for the larger set and the subset (if it matters). The number of subsets is typically in the 10s of thousands, but if it makes a difference I would be interested in cases where it could be in the hundreds of millions. The size of the subsets is typically in 10s.

I am programming in C++

Thanks

like image 473
zenna Avatar asked Jun 28 '11 20:06

zenna


3 Answers

Mathematically, you should construct the Hasse diagram for your sets, which will be the partially ordered set with vertices your sets and arrows given by containment. Essentially, you want to create a directed, acyclic graph with an arrow A --> B if A strictly contains B and there is no C such that A strictly contains C and C strictly contains B.

This is actually going to be a ranked poset, meaning that you can keep track of "levels" of the digraph based on the cardinality of the sets. This is sort of like creating a hash table to jump to the right set.

From A, just do a BFS down the graph to find all proper subsets of A.

How to implement this: (in pseudocode)

for (C in sets) {
    for (B in HasseDiagram at rank rank(C)+1) {
      if (C contains B)
        addArrow(C,B)
    }
    for (A in HasseDiagram at rank rank(C)+1) {
      if (C contains A)
        addArrow(A,C)
    }
    addToDiagram(C)
}

To make this and all the subroutines fast, you can encode each set an a binary where digit i is 1 if i is in C and 0 otherwise. This makes testing containment and determining rank trivial.

The above method works if you have all possible subsets. Since you may be missing some, you'll have to check more things. For the pseudocode, you'll need to change rank(C)-1 to the largest integer l < rank(C) such that some element of the HasseDiagram has rank l, and similarly for rank(C)+1. Then, when you're adding the set C to the diagram:

  1. If A covers C, then you only need to check lower ranked sets B that are also covered by A.

  2. If C covers B, then you only need to check higher ranked sets A that also cover by B.

By "X covers Y" I mean there is an arrow X -> Y, not just a path.

Furthermore, when you insert C between A and B using one of the above checks, you will need to remove the arrow A --> B when you add A --> C and C --> B.

like image 114
PengOne Avatar answered Nov 16 '22 14:11

PengOne


I would suggest storing all of the sets in a tree. Each node of the tree would represent all sets that contained a specified initial list of integers. I would have the nodes contain the following pieces of information:

  1. The number of additional elements in the smallest set at this point or below in the tree. (0 means that this node is in the tree.)
  2. A bitset representing the intersection of all subsets below this one in the tree.
  3. A pointer to an array mapping larger integers to subtrees that contain that as the next element. As a special case, if there is only one subset below this one in the tree, this pointer could be null. (There is no need to fill out unpopulated parts of the tree.)

Given this tree, and a subset you can do a search with recursion and backtracking for all subsets of the set. In your search you start with the first element of the subset, look for all subsets that contain that element, then you search for all subsets that don't contain that element.

Building this tree takes time and space at most O(n * m * k) where n is the number of subsets m is the average number of elements per subset, and k is the size of the universe of elements that can be in the sets. With random sets of sets that are much smaller than the possible universe of subsets of your k elements you won't construct most of the tree, and it will take O(n * m) for your tree.

In theory traversing this tree could be time O(n). But in practice you'll trim branches of the tree fairly early, and won't traverse most of the other subsets. A back of the envelope calculation suggests that if you have n random sets out of a k element universe with n << 2k then a search of the tree is O(n0.5k). (At each integer, half the time it is in your set you're searching for subsets of and you split your search into 2, and half the time it isn't in your set and you eliminate half of your space. After j integers you've got 2j/2 searches going of sets of sets of size 2-jn. Thus by the time you get the searches down to single other subsets to compare with, there are O(n0.5) searches going. The final comparison of bitmaps is O(k).)

Note: I'm convinced by this back of the envelope calculation that the average performance is o(n0.5+epsilon) for every epsilon > 0, but the convergence is very slow. More precisely I suspect that the arithmetic average of the performance is n0.5 + O(sqrt(log(n)))). But that sqrt(log(n)) piece takes a long time to converge.

Note that using the number of additional elements in the smallest set at this point or below in the tree lets your search trivially filter out all sets that are too large to be subsets. Depending on your dataset, this may or may not lead to useful speedups.

like image 43
btilly Avatar answered Nov 16 '22 13:11

btilly


The approach suggested by PengOne would work, but it is not very efficient. To see why it fails, consider the following pathological example:

Suppose you have a universe U, which has n distinct elements, and let the set of all the sets you are searching over consist of all subsets of U with exactly k elements. Then it is true that no pair of sets here are strictly contained in one another; and so in the worst case you would have to search over all n choose k possible sets! In other words, using his proposed data structure is no better than a naive linear search in the worst case.

Clearly you can do much better than this, and the correct data structure to use would be a trie: http://en.wikipedia.org/wiki/Trie

To adapt a trie to work for sets instead of just strings, it is sufficient to fix an ordering on the elements of the universal set, then encode each of your subsets as a binary string of finite length, where the ith character is 0 or 1 depending on whether the set contains the ith element. Here is an implementation in python

import math

class SetTree:
    def __init__(self, index, key, left, right):
        self.index = index
        self.key = key
        self.left = left
        self.right = right

cached_trees = { }
cached_index = 2

def get_index(T):
    if isinstance(T, SetTree):
        return T.index
    if T:
        return 1
    return 0        

def make_set_tree(key, left, right):
    global cached_trees, cached_index
    code = (key, get_index(left), get_index(right))
    if not code in cached_trees:
        cached_trees[code] = SetTree(cached_index, key, left, right)
        cached_index += 1
    return cached_trees[code]

def compute_freqs(X):
    freqs, total = {}, 0
    for S in X:
        for a in S:
            if a in freqs:
                freqs[a] += 1
            else:
                freqs[a] = 1
            total += 1
    U = [ (-f, a) for a,f in freqs.items() ]
    U.sort()
    return U

#Constructs the tree recursively
def build_tree_rec(X, U):
    if len(X) == 0:
        return False
    if len(U) == 0:
        return True

    key = U[0][1]

    left_elems = [ S for S in X if key in S]

    if len(left_elems) > 0:
        return make_set_tree(key,
            build_tree_rec(left_elems, U[1:]),
            build_tree_rec([ S for S in X if not key in S ], U[1:]))

    return build_tree_rec(X, U[1:])

#Build a search tree recursively
def build_tree(X):
    U = compute_freqs(X)
    return build_tree_rec(X, U)


#Query a set tree to find all subsets contained in a given set
def query_tree(T, S):
    if not isinstance(T, SetTree):
        return [ [] ] if T else []
    if T.key in S:
        return [ U + [ T.key ] for U in query_tree(T.left, S) ] + query_tree(T.right, S)
    return query_tree(T.right, S)

#Debugging function: Converts a tree to a tuple for printing
def tree_to_tuple(T):
    if isinstance(T, SetTree):
        return (T.key, tree_to_tuple(T.left), tree_to_tuple(T.right))
    return T

Now here is an example usage:

In [15]: search_tree = set_search.build_tree(set_family)

In [16]: set_search.tree_to_tuple(search_tree)
Out[16]: 
(2,
 (4, (5, True, True), (5, True, (3, True, False))),
 (4, (5, True, False), (1, True, False)))

In [17]: set_search.query_tree(search_tree, set([2,3,4,5]))
Out[17]: [[5, 4, 2], [4, 2], [5, 2], [3, 2], [5, 4]]

In [18]: set_search.query_tree(search_tree, set([1,2,3,4,5]))
Out[18]: [[5, 4, 2], [4, 2], [5, 2], [3, 2], [5, 4], [1]]

In [19]: set_search.query_tree(search_tree, set([2,4,5]))
Out[19]: [[5, 4, 2], [4, 2], [5, 2], [5, 4]]

In [20]: set_search.query_tree(search_tree, set([2,5]))
Out[20]: [[5, 2]]

In [21]: set_search.query_tree(search_tree, set([1]))
Out[21]: [[1]]

In [22]: set_search.query_tree(search_tree, set([15]))
Out[22]: []

Note that the amount of work performed by query_tree is proportional to the size of the subtree which represents the set of all results returned by query_tree. Thus our goal is to compute the size of one of the subtries (on average) and then as a secondary goal to minimize this quantity. One way to do this is to reorder the elements of the universal in terms of descending frequency, so that they are repeated as few times as possible in the lower levels of the tree. This optimization is also done in the above code. A secondary optimization is to cache the trees which have already been searched to avoid having to redo unnecessary work.

EDIT: Just after I got done typing this up, I saw btilly's answer, which is comes to more or less the same conclusion about the problem (modulo some technical nitpicks which I have moved into the comments on his post.)

EDIT 2: Realized that this is really just a special case of a binary decision diagram. Don't really have enough energy to fix the write up right now, so will leave it as is. Perhaps fix it tomorrow. http://en.wikipedia.org/wiki/Binary_decision_diagram

like image 3
Mikola Avatar answered Nov 16 '22 12:11

Mikola