I'm trying to build a data structure for a word game solver.
I need to store about 150,000 sets of the form {A, A, D, E, I, L, P, T, V, Y}. (They are normalized English words, i.e. characters sorted. Note this is a multiset which can contain the same letter twice.)
Need to efficiently get a yes/no answer to the following kind of query: are there any sets with the given subset? For example, do any of the known words contain the set {D, E, I, L, L, P}?
Requirements:
Is there any data structure out there that would suit this need well? This is a little different from other set matching questions on StackOverflow in that the target sets are actually multisets.
This reminds me of a mutated prefix tree/trie that I made once. Slightly different but it might work. It may not work if you have large/no bounds or if you can't convert it to your language (I code in c++).
So basically, in a trie you usually store children corresponding to the next letter but what I did was I stored children corresponding to the frequency of each letter.
What the question basically is (from my point of view) is, "Are there any sets that have the same or more of a letter than in the subset?" For example, if the subset is { A,D,E,E } then you need to find if there is a set with at least one A, one D and two E's.
So, for the trie you have something like this
Root
/ | \
/ /|\ \
/ / | \ \
1 2 ... MAX <-- This represents the frequency of "A"
/|\ ..... /|\
1..MAX 1..MAX <-- Frequency of "B"
...............
...............
...............
1 ... ... ... MAX <-- Frequency of "Y"
/|\ .... .... / | \
1..MAX ...... 1 .. MAX <-- Frequency of "Z"
Basically all of the ...'s represent lots of stuff that would take too long to show. The /,| and \ represent parent child relationship and MAX represent the maximum frequency of a letter
So what you do, is you have a struct (I code in c++) of some sort like:
struct NODE {
NODE *child[MAX + 1]; // Pointers to other NODE's that represents
// the frequency of the next letter
};
When you create a node you need to initialize all its children to NULL. You can either do this through a constructor (in c++) or a makeNode() function like
NODE* makeNode() {
NODE* n = new NODE; // Create a NODE
for(int i = 0;i <= MAX;i++) // For each child
n->child[i] = NULL; // Initialize to NULL
};
At the start, the trie is just a root
NODE* root = new NODE;
When you add a set to the trie, you get the frequency of each letter and go through the trie. If at a particular node, the child corresponding to the next letter is NULL, you just create a new NODE.
When you search the trie, you search all of the children of each node that corresponds to the frequency of the letter in the subset or larger. For example if the subset has 3 A's you search all of the root->child[3] then root->child[4] then ... then root->child[MAX].
It's probably overly complicated and confusing so 1) If you think I'm not mad then please comment on what's confusing and 2) You may/probably want to just find a simpler method
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With