Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to retrieve a random word of a given length from a Trie

I have a simple Trie that I'm using to store about 80k words of length 2 - 15. It works great for checking to see if a string is a word; However, now I need a way of getting a random word of a given length. In other words, I need "getRandomWord(5)" to return a 5 letter word, with all 5 letter words having an equal chance of being returned.

The only way I can think of is to pick a random number and traverse the tree breadth-first until I've passed that many words of the desired length. Is there a better way to do this?

Possibly unnecessary, but here's the code for my trie.

class TrieNode {
    private TrieNode[] c;
    private Boolean end = false;

    public TrieNode() {
        c = new TrieNode[26]; 
    }

    protected void insert(String word) {
        int n = word.charAt(0) - 'A';
        if (c[n] == null)
            c[n] = new TrieNode();
        if (word.length() > 1) {
            c[n].insert(word.substring(1));
        } else {
            c[n].end = true;
        }
    }

    public Boolean isThisAWord(String word) {
        if (word.length() == 0)
            return false;
        int n = word.charAt(0) - 'A';
        if (c[n] != null && word.length() > 1)
            return c[n].isThisAWord(word.substring(1));
        else if (c[n] != null && c[n].end && word.length() == 1)
            return true;
        else
            return false;
    }
}

Edit: The marked answer worked well; I'll add my implementation here for posterity, in case it helps anyone with a similar problem.

First, I made a helper class to hold metadata about the TrieNodes I'm using in the search:

class TrieBranch {
    TrieNode node;
    int letter;
    int depth;
    public TrieBranch(TrieNode n, int l, int d) {
        letter = l; node = n; depth = d;
    }
}

This is the class that holds the Trie and implements the search for the random word. I'm kind of a beginner so there may be better ways to do this, but I tested this a bit and it seems to work. No error handling, so caveat emptor.

class Dict {

    final static int maxWordLength = 13;    
    final static int lettersInAlphabet = 26;
    TrieNode trie;
    int lengthFrequencyByLetter[][];
    int totalLengthFrequency[];

    public Dict() {
        trie = new TrieNode();
        lengthFrequencyByLetter = new int[lettersInAlphabet][maxWordLength + 1];
        totalLengthFrequency = new int[maxWordLength + 1];
    }

    public String getRandomWord(int length) {
        // Returns a random word of the specified length from the trie
        // First, pick a random number from 0 to [number of words with this length]
        Random r = new Random();
        int wordIndex = r.nextInt(totalLengthFrequency[length]);

        // figure out what the first letter of this word would be
        int firstLetter = -1, totalSoFar = 0;
        while (totalSoFar <= wordIndex) {
            firstLetter++;
            totalSoFar += lengthFrequencyByLetter[firstLetter][length];
        }
        wordIndex -= (totalSoFar - lengthFrequencyByLetter[firstLetter][length]);

        // traverse the (firstLetter)'th node of trie depth-first to find the word (wordIndex)'th word
        int[] result = new int[length + 1];
        Stack<TrieBranch> stack = new Stack<TrieBranch>();
        stack.push(new TrieBranch(trie.getBranch(firstLetter), firstLetter, 1));
        while (!stack.isEmpty()) {
            TrieBranch n = stack.pop();
            result[n.depth] = n.letter;

            // examine the current node
            if (n.depth == length && n.node.isEnd()) {
                wordIndex--;
                if (wordIndex < 0) {
                    // search is over
                    String sResult = "";
                    for (int i = 1; i <= length; i++) {
                        sResult += (char)(result[i] + 'a');
                    }
                    return sResult;
                }
            }

            // handle child nodes unless they're deeper than target length
            if (n.depth < length) {
                for (int i = 25; i >= 0; i--) {
                    if (n.node.getBranch(i) != null)
                        stack.push(new TrieBranch(n.node.getBranch(i), i, n.depth + 1));
                }
            }
        }
        return "failure of some sort";
    }
}

Using a casual dictionary (80k words max length 12) each call to getRandomWord() takes abount .2ms, and using a more thorough dictionary (250K words, max length 24) each call takes about 1ms.

like image 811
DevOfZot Avatar asked Jun 17 '13 16:06

DevOfZot


1 Answers

To make sure you have an even chance of getting each 5-letter word, you need to know how many 5-letter words there are in your tree. So as you construct the tree, you add the length of the word you're adding to two counters: an overall frequency counter, and a by-letter frequency counter:

int lengthFrequencyByLetter[letterIndex][maxWordLength-1]
int totalLengthFrequency[maxWordLength-1]

So if you have 4000 5-letter words, and 213 of them start with "d", then

lengthFrequencyByLetter[3][4] = 213

and

totalLengthFrequency[4] = 4000

after you're done adding everything to your tree. (The letter "a" is 0, "b" is 1, ... "z" is 25.)

From here, you can do a search for the nth word of a given length, where n is a random integer picked from a uniform random distribution, in the range (0, totalLengthFrequency[length-1]).

Let's say you have 4000 5-letter words in your structure. You pick random number 1234. Now you can check

lengthFrequencyByLetter[0][4]
lengthFrequencyByLetter[1][4]
lengthFrequencyByLetter[2][4]
lengthFrequencyByLetter[3][4]

in turn, until you exceed a total of 1234. Then you know quickly what the start letter of the 1234th 5-letter word is, and then search there. You don't have to search every word in the tree from the beginning each time.

like image 190
John Avatar answered Oct 23 '22 13:10

John