Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for querying whether a given subset exists in a collection of sets

I'm trying to build a data structure for a word game solver.

I need to store about 150,000 sets of the form {A, A, D, E, I, L, P, T, V, Y}. (They are normalized English words, i.e. characters sorted. Note this is a multiset which can contain the same letter twice.)

Need to efficiently get a yes/no answer to the following kind of query: are there any sets with the given subset? For example, do any of the known words contain the set {D, E, I, L, L, P}?

Requirements:

  • Queries must be fast
  • The data structure should fit in a reasonable amount of space (e.g. <50 MB)
  • The data structure need not be built in real time; it's pre-computed.

Is there any data structure out there that would suit this need well? This is a little different from other set matching questions on StackOverflow in that the target sets are actually multisets.

like image 463
PBJ Avatar asked Mar 05 '11 01:03

PBJ


1 Answers

This reminds me of a mutated prefix tree/trie that I made once. Slightly different but it might work. It may not work if you have large/no bounds or if you can't convert it to your language (I code in c++).

So basically, in a trie you usually store children corresponding to the next letter but what I did was I stored children corresponding to the frequency of each letter.

What the question basically is (from my point of view) is, "Are there any sets that have the same or more of a letter than in the subset?" For example, if the subset is { A,D,E,E } then you need to find if there is a set with at least one A, one D and two E's.

So, for the trie you have something like this

            Root
           / | \
          / /|\ \
         / / | \ \
        1 2  ... MAX <-- This represents the frequency of "A"
       /|\ ..... /|\
      1..MAX    1..MAX <-- Frequency of "B"
      ...............
      ...............
      ...............
     1 ... ... ... MAX <-- Frequency of "Y"
    /|\ .... .... / | \
   1..MAX ...... 1 .. MAX <-- Frequency of "Z"

Basically all of the ...'s represent lots of stuff that would take too long to show. The /,| and \ represent parent child relationship and MAX represent the maximum frequency of a letter

So what you do, is you have a struct (I code in c++) of some sort like:

struct NODE {
    NODE *child[MAX + 1]; // Pointers to other NODE's that represents
                          // the frequency of the next letter
};

When you create a node you need to initialize all its children to NULL. You can either do this through a constructor (in c++) or a makeNode() function like

NODE* makeNode() {
    NODE* n = new NODE;         // Create a NODE
    for(int i = 0;i <= MAX;i++) // For each child
        n->child[i] = NULL;     // Initialize to NULL
};

At the start, the trie is just a root

NODE* root = new NODE;

When you add a set to the trie, you get the frequency of each letter and go through the trie. If at a particular node, the child corresponding to the next letter is NULL, you just create a new NODE.

When you search the trie, you search all of the children of each node that corresponds to the frequency of the letter in the subset or larger. For example if the subset has 3 A's you search all of the root->child[3] then root->child[4] then ... then root->child[MAX].

It's probably overly complicated and confusing so 1) If you think I'm not mad then please comment on what's confusing and 2) You may/probably want to just find a simpler method

like image 123
flight Avatar answered Sep 29 '22 12:09

flight