Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm and Data Structure for Checking letters in a word with another set of letters

I have a dictionary of 200,000 words and a set of letters. I need an algorithm to check if all the letters of a word are in that set of letters. It's very slow to check the words one by one. Because there is a huge number of words to process, I need a data structure to do this. Any ideas? Thanks!

For example: I have a set of letters {b,g,e,f,t,u,i,t,g,n,c,m,m,w,c,s}, I wanna check if word "big" and "buff". All letters of "big" are a subset of the original set then "big" is what i want while "buff" is not what i want because there is only one "f" in the original set. This is what i wanna do.

like image 905
AaronTry Avatar asked Mar 23 '23 11:03

AaronTry


1 Answers

This is for something like Scrabble or Boggle, right? Well, what you do is pre-generate your dictionary by sorting the letters in each word. So, word becomes dorw. Then you shove all these into a Trie data structure. So, in your Trie, the sequence dorw would point to the value word.

[Note that because we sorted the words, they lose their uniqueness, so one sorted word can point to multiple different words. ie your Trie needs to store a list or array at its data nodes]

You can save this structure out if you need to load it quickly later without all the sorting steps.

What you then do is take your input letters and you sort them too. You then start walking through your Trie recursively. If the current letter matches an existing path in the Trie, you follow it. Because you can have unused letter, you also allow the current letter to be dropped.

And it's that simple. Any time you encounter a node in your Trie that has a value, that's a word that you can make out of the letters you used to get there. You just add these words to a list as you find them, and when the recursion is done you have found every possible word.

If you have repeated letters in your input, you may need extra logic to prevent multiple instances of the same word being given (unless you want that). That logic can be invoked during the step that 'leaves out' a letter (you just skip past all the repeated letters) to the next letter.


[edit] You seem to want to do the opposite. My solution above finds all possible words that can be made from a set of letters. But you want to test a specific word to see if it's allowed, given the set of letters you have.

This is simple.

Store your available letters as a histogram. That is, for each letter, you store the number that you have. Then, you walk through each letter in your test word, building a new histogram as you go. As soon as one of your histogram buckets exceeds the value in your available-letters, the word cannot be made. If you get all the way to the end, you can successfully make the word.

like image 155
paddy Avatar answered Apr 26 '23 01:04

paddy