Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on. I'll be needing to query the resulting data structure for subsets or sets that are already stored.

=== Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves. I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.

Now, an even simpler example:

[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]

Query:

[alpha, delta]

Result:

[alpha, beta, gama, delta] [alpha, epsilon, delta]

So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.

== Original examples:

For example, say the database contains these sets

[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1] 

If I use [A1, C1] as a query, these two sets should be returned as a result:

[A1, B1, C1, D1], [A1, D3, C1]

Example 2:

Database:

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]

Query:

[Distance to berlin: 240km]

Result

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]

There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.

What algorithms are there for such needs?

I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.

Clarifications:

  • If it helps answer the question, I'm intending on using them for storing states: Simple example: [Has milk, Doesn't have eggs, Has Sugar]
  • I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure

Conclusion I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:

enter image description here

like image 727
Ed Rowlett-Barbu Avatar asked Jun 04 '14 10:06

Ed Rowlett-Barbu


2 Answers

I'm confident that I can now contribute to the solution. One possible quite efficient way is a:

Trie invented by Frankling Mark Liang

Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.

The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.

What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).

But first a drawing by me:

Simple Set-Trie drawing

The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.

Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.

Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).

Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.

Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.

I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.

I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.

like image 92
Trilarion Avatar answered Oct 18 '22 10:10

Trilarion


How about having an inverse index built of hashes?

Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.

Then you define a map that maps an index to a vector of pointers to the tuples

std::map<size_t,std::vector<TupleStruct*> > mymap;

or, if you can use global indices, just

std::map<size_t,std::vector<size_t> > mymap;

For retrieval by queries X and Y, you need to

  1. get hash value of the queries Xh and Yh
  2. get the corresponding "sets" out of mymap
  3. intersect the sets mymap[Xh] and mymap[Yh]
like image 38
Pavel Avatar answered Oct 18 '22 08:10

Pavel