Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing T9 text prediction

I have a T9 dictionary in memory (trie/hash_map). The dictionary contains word-rating pairs, so when a word is picked from dictionary, its rating increments, and the word-rating pair goes up in the word list.

Let's say that there is a method to pick a word from the dictionary. That method also does some word rating routine.

In input I have string of numbers (1-9, '*' to change word and ' ') which were pressed on telephone.

Questions:

  1. Is there any algorithm to parse the string fast?
  2. Which data structure would be good there?

UPD:

Full problem text (Problem D)

Hash_map implementation

Trie implementation

like image 726
sashab Avatar asked Feb 24 '23 04:02

sashab


1 Answers

One option that I think would be particularly efficient would be to preprocess the trie into a modified structure specifically optimized for predicting words based on keystrokes.

Intuitively, the new structure is a trie created out of the possible digits that could be pressed at any point. Each trie node then stores a priority queue of the words that could potentially be spelled out using those digits. You could then predict which words to use via the following algorithm:

  • Start at the root of the trie.
  • For each digit, follow the pointer corresponding to the digit.
  • If you walk off the trie, then there aren't any suggestions to make.
  • Otherwise, look at the priority queue for the words that can be formed from precisely these digits, then suggest the element in that priority queue with the highest use count.

This algorithm takes time O(n + Tmax), where n is the length of the digit string and Tmax is the time required to find the most popular word for the given prefix. This could be O(1) for something like a binary heap, but could be slower for more complex heaps.

To build this data structure, you would preprocess the original list of words/frequencies as follows. For each word, determine the series of digits corresponding to its letters. For example, the word "monsoon" would be 6667666, while the word "apple" would be 27753. Then, insert that sequence into the digit trie, creating new nodes as necessary. When you arrive at the final node corresponding to this digit string, insert the word into the corresponding priority queue for that node. The total time for this operation is actually quite good; given a list of words with a total of n letters in it, this can be done in O(n Tinsert) time, where Tinsert is the time required to insert a word into the priority queue. This is because we need to visit each letter at most three times - once to convert it to a digit, once to follow the appropriate edge in the digit trie, and once to insert it into a priority queue.

This approach also makes it easy to insert new words into the predictor quite easily. Whenever the user walks off the digit trie or chooses a word that isn't the number one word, you could just insert that word into the digit trie by creating the appropriate series of nodes in the trie, then inserting the word into the correct priority queue.

If you make your priority queues out of a balanced binary search tree that stores a pointer to its smallest element, you can implement finding the fastest word for a series of n digits in O(n) time, and can then list off all of the other words with that prefix in amortized O(1) time apiece. You could also insert or delete new words very easily this way.

Because the words are stored in a trie, you can in O(1) update your guess of the current word by just walking from the current trie node down into the child node given by the current number. When the user hits the space key, you would just reset back up to the root of the digit trie. Finally, when the user hits star, you could use the above trick to just look at the next most-popular word for the given trie node.

Hope this helps!

like image 100
templatetypedef Avatar answered Mar 04 '23 08:03

templatetypedef