I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.
I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}
Data items 1, 3 and 4 match the set because they contain all items in the set.
I need to design a data structure that is super fast at identifying whether a data item is a member of a set includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.
My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?
Edit:
Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).
For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.
Boyer Moore Algorithm: This algorithm uses best heurestics of Naive and KMP algorithm and starts matching from the last character of the pattern. Using the Trie data structure: It is used as an efficient information retrieval data structure.
Data structures for indexingB-trees are the most commonly used data structures for indexes as they are time-efficient for lookups, deletions, and insertions. All these operations can be done in logarithmic time. Data that is stored inside of a B-tree can be sorted.
A set is a data structure that stores unique elements of the same type in a sorted order. Each value is a key, which means that we access each value using the value itself. With arrays, on the other hand, we access each value by its position in the container (the index). Accordingly, each value in a set must be unique.
I see another solution which is dual to yours (i.e., testing a data item against every set) and that is using a binary tree where each node tests whether a specific item is included or not.
For instance if you had the sets A = { 2, 3 } and B = { 4 } and C = { 1, 3 } you'd have the following tree
_NOT_HAVE_[1]___HAVE____
| |
_____[2]_____ _____[2]_____
| | | |
__[3]__ __[3]__ __[3]__ __[3]__
| | | | | | | |
[4] [4] [4] [4] [4] [4] [4] [4]
/ \ / \ / \ / \ / \ / \ / \ / \
. B . B . B . B B C B A A A A
C B C B
C
After making the tree, you'd simply need to make 50 comparisons---or how ever many items you can have in a set.
For instance, for { 1, 4 }, you branch through the tree : right (the set has 1), left (doesn't have 2), left, right, and you get [ B ], meaning only set B is included in { 1, 4 }.
This is basically called a "Binary Decision Diagram". If you are offended by the redundancy in the nodes (as you should be, because 2^50 is a lot of nodes...) then you should consider the reduced form, which is called a "Reduced, Ordered Binary Decision Diagram" and is a commonly used data-structure. In this version, nodes are merged when they are redundant, and you no longer have a binary tree, but a directed acyclic graph.
The Wikipedia page on ROBBDs can provide you with more information, as well as links to libraries which implement this data-structure for various languages.
I can't prove it, but I'm fairly certain that there is no solution that can easily beat the O(n) bound. Your problem is "too general": every set has m = 50 properties (namely, property k is that it contains the number k) and the point is that all these properties are independent of each other. There aren't any clever combinations of properties that can predict the presence of other properties. Sorting doesn't work because the problem is very symmetric, any permutation of your 50 numbers will give the same problem but screw up any kind of ordering. Unless your input has a hidden structure, you're out of luck.
However, there is some room for speed / memory tradeoffs. Namely, you can precompute the answers for small queries. Let Q
be a query set, and supersets(Q)
be the collection of sets that contain Q
, i.e. the solution to your problem. Then, your problem has the following key property
Q ⊆ P => supersets(Q) ⊇ supersets(P)
In other words, the results for P = {1,3,4}
are a subcollection of the results for Q = {1,3}
.
Now, precompute all answers for small queries. For demonstration, let's take all queries of size <= 3. You'll get a table
supersets({1})
supersets({2})
...
supersets({50})
supersets({1,2})
supersets({2,3})
...
supersets({1,2,3})
supersets({1,2,4})
...
supersets({48,49,50})
with O(m^3) entries. To compute, say, supersets({1,2,3,4})
, you look up superset({1,2,3})
and run your linear algorithm on this collection. The point is that on average, superset({1,2,3})
will not contain the full n = 50,000 elements, but only a fraction n/2^3 = 6250 of those, giving an 8-fold increase in speed.
(This is a generalization of the "reverse index" method that other answers suggested.)
Depending on your data set, memory use will be rather terrible, though. But you might be able to omit some rows or speed up the algorithm by noting that a query like {1,2,3,4}
can be calculated from several different precomputed answers, like supersets({1,2,3})
and supersets({1,2,4})
, and you'll use the smallest of these.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With