Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrabble word finder: building a trie, storing a trie, using a trie?

Tags:

c#

mysql

trie

What I’m trying to do:

  • Build a mobile web application where the user can get help finding words to play when playing scrabble
  • Users get word suggestions by typing in any amount of letters and 0 or more wildcards

How I’m trying to do this:

  • Using MySQL database with a dictionary containing over 400k words
  • Using ASP.NET with C# as server-side programming language
  • Using HTML5, CSS and Javascript

My current plan:

  • Building a Trie with all the words from the database so I can do a fast and accurate search for words depending on user letter/wildcard input

Having a plan is no good if you can’t execute it, this is what I need help with:

  • How do I build a Trie from the database? (UPDATE: I want to generate a Trie using the words already in my database, after that's done I'm not going to use the database for word matching any more)
  • How do I store the Trie for fast and easy access? (UPDATE: So I can trash my database)
  • How do I use C# to search for words using the Trie depending on letters and wildcards?

Finally:
Any help is very much appreciated, I’m still a beginner with C# and MySQL so please be gentle

Thank you a lot!

like image 986
Linus Jäderlund Avatar asked Sep 16 '11 10:09

Linus Jäderlund


1 Answers

First off, let's look at the constraints on the problem. You want to store a word list for a game in a data structure that efficiently supports the "anagram" problem. That is, given a "rack" of n letters, what are all the n-or-fewer-letter words in the word list that can be made from that rack. the word list will be about 400K words, and so is probably about one to ten megs of string data when uncompressed.

A trie is the classic data structure used to solve this problem because it combines both memory efficiency with search efficiency. With a word list of about 400K words of reasonable length you should be able to keep the trie in memory. (As opposed to going with a b-tree sort of solution where you keep most of the tree on disk because it is too big to fit in memory all at once.)

A trie is basically nothing more than a 26-ary tree (assuming you're using the Roman alphabet) where every node has a letter and one additional bit on each node that says whether it is the end of the word.

So let's sketch the data structure:

class TrieNode
{
    char Letter;
    bool IsEndOfWord;
    List<TrieNode> children; 
}

This of course is just a sketch; you'd probably want to make these have proper property accessors and constructors and whatnot. Also, maybe a flat list is not the best data structure; maybe some sort of dictionary is better. My advice is to get it working first, and then measure its performance, and if it is unacceptable, then experiment with making changes to improve its performance.

You can start with an empty trie:

TrieNode root = new TrieNode('^', false, new List<TrieNode>());

That is, this is the "root" trie node that represents the beginning of a word.

How do you add the word "AA", the first word in the Scrabble dictionary? Well, first make a node for the first letter:

root.Children.Add('A', false, new List<TrieNode>());

OK, our trie is now

^
|
A

Now add a node for the second letter:

root.Children[0].Children.Add(new trieNode('A', true, new List<TrieNode>()));

Our trie is now

^
|
A
|
A$   -- we notate the end of word flag with $

Great. Now suppose we want to add AB. We already have a node for "A", so add to it the "B$" node:

root.Children[0].Children.Add(new trieNode('B', true, new List<TrieNode>());

and now we have

    ^
    |
    A
   / \
  A$   B$

Keep on going like that. Of course, rather than writing "root.Children[0]..." you'll write a loop that searches the trie to see if the node you want exists, and if not, create it.

To store your trie on disk -- frankly, I would just store the word list as a plain text file and rebuild the trie when you need to. It shouldn't take more than 30 seconds or so, and then you can re-use the trie in memory. If you do want to store the trie in some format that is more like a trie, it shouldn't be hard to come up with a serialization format.

To search the trie for matching a rack, the idea is to explore every part of the trie, but to prune out the areas where the rack cannot possibly match. If you haven't got any "A"s on the rack, there is no need to go down any "A" node. I sketched out the search algorithm in your previous question.

I've got an implementation of a functional-style persistent trie that I've been meaning to blog about for a while but never got around to it. If I do eventually post that I'll update this question.

like image 122
Eric Lippert Avatar answered Nov 03 '22 10:11

Eric Lippert