Data structure for matching sets

Tags:

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}

Data items 1, 3 and 4 match the set because they contain all items in the set.

I need to design a data structure that is super fast at identifying whether a data item ~~is a member of a set~~ includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.

My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?

Edit: Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).

For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

520

asked Aug 03 '10 01:08

Daniel

2 Answers

I see another solution which is dual to yours (i.e., testing a data item against every set) and that is using a binary tree where each node tests whether a specific item is included or not.

For instance if you had the sets A = { 2, 3 } and B = { 4 } and C = { 1, 3 } you'd have the following tree

                      _NOT_HAVE_[1]___HAVE____
                      |                      |            
                _____[2]_____          _____[2]_____
                |           |          |           |
             __[3]__     __[3]__    __[3]__     __[3]__
             |     |     |     |    |     |     |     |
            [4]   [4]   [4]   [4]  [4]   [4]   [4]   [4]
            / \   / \   / \   / \  / \   / \   / \   / \
           .   B .   B .   B .   B    B C   B A   A A   A
                                            C     B C   B
                                                        C

After making the tree, you'd simply need to make 50 comparisons---or how ever many items you can have in a set.

For instance, for { 1, 4 }, you branch through the tree : right (the set has 1), left (doesn't have 2), left, right, and you get [ B ], meaning only set B is included in { 1, 4 }.

This is basically called a "Binary Decision Diagram". If you are offended by the redundancy in the nodes (as you should be, because 2^50 is a lot of nodes...) then you should consider the reduced form, which is called a "Reduced, Ordered Binary Decision Diagram" and is a commonly used data-structure. In this version, nodes are merged when they are redundant, and you no longer have a binary tree, but a directed acyclic graph.

The Wikipedia page on ROBBDs can provide you with more information, as well as links to libraries which implement this data-structure for various languages.

answered Sep 22 '22 02:09

Jérémie

I can't prove it, but I'm fairly certain that there is no solution that can easily beat the O(n) bound. Your problem is "too general": every set has m = 50 properties (namely, property k is that it contains the number k) and the point is that all these properties are independent of each other. There aren't any clever combinations of properties that can predict the presence of other properties. Sorting doesn't work because the problem is very symmetric, any permutation of your 50 numbers will give the same problem but screw up any kind of ordering. Unless your input has a hidden structure, you're out of luck.

However, there is some room for speed / memory tradeoffs. Namely, you can precompute the answers for small queries. Let Q be a query set, and supersets(Q) be the collection of sets that contain Q, i.e. the solution to your problem. Then, your problem has the following key property

Q ⊆ P  =>  supersets(Q) ⊇ supersets(P)

In other words, the results for P = {1,3,4} are a subcollection of the results for Q = {1,3}.

Now, precompute all answers for small queries. For demonstration, let's take all queries of size <= 3. You'll get a table

supersets({1})
supersets({2})
...
supersets({50})
supersets({1,2})
supersets({2,3})
...
supersets({1,2,3})
supersets({1,2,4})
...

supersets({48,49,50})

with O(m^3) entries. To compute, say, supersets({1,2,3,4}), you look up superset({1,2,3}) and run your linear algorithm on this collection. The point is that on average, superset({1,2,3}) will not contain the full n = 50,000 elements, but only a fraction n/2^3 = 6250 of those, giving an 8-fold increase in speed.

(This is a generalization of the "reverse index" method that other answers suggested.)

Depending on your data set, memory use will be rather terrible, though. But you might be able to omit some rows or speed up the algorithm by noting that a query like {1,2,3,4} can be calculated from several different precomputed answers, like supersets({1,2,3}) and supersets({1,2,4}), and you'll use the smallest of these.

answered Sep 23 '22 02:09

Heinrich Apfelmus

Related questions
                            
                                std::set, how do lower_bound and upper_bound work?
                            
                                Redefinitions of constexpr static data members are allowed now? (but not inline const)?
                            
                                Memory-efficient custom deleter for std::unique_ptr?
                            
                                How do I retarget a VS2017 solution in VS2015?
                            
                                C++ reserved symbol as C variable name
                            
                                Perfect-forwarding a return value with auto&&
                            
                                Can std::launder be used to convert an object pointer to its enclosing array pointer?
                            
                                Injected class name as type
                            
                                Can I still rely on the order of the output elements when using par_unseq?
                            
                                How to perform double buffering with atomic pointers?
                            
                                How to unroll a parameter pack from right to left
                            
                                Is difference of two constexpr instances of __func__ pointers still constexpr?
                            
                                Move constructor called twice when move-constructing a std::function from a lambda that has by-value captures
                            
                                Propagation of Oracle Transactions Between C++ and Java
                            
                                Python embedded in CPP: how to get data back to CPP
                            
                                Passing object ownership in C++
                            
                                Does using __declspec(novtable) on abstract base classes affect RTTI in any way?
                            
                                Iterating over all pairs of elements in std-containers (C++)
                            
                                How does an exception specification affect virtual destructor overriding?
                            
                                Forward declaring classes in namespaces

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data structure for matching sets

Tags:

c++

c

algorithm

data-structures

Daniel

People also ask

2 Answers

Jérémie

Heinrich Apfelmus

Recent Activity

Donate For Us