Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the words in a long stream of characters. Auto-tokenize

How would you find the correct words in a long stream of characters?

Input :

"The revised report onthesyntactictheoriesofsequentialcontrolandstate"

Google's Output:

"The revised report on syntactic theories sequential controlandstate"

(which is close enough considering the time that they produced the output)

How do you think Google does it? How would you increase the accuracy?

like image 787
unj2 Avatar asked Oct 10 '10 17:10

unj2


2 Answers

I would try a recursive algorithm like this:

  • Try inserting a space at each position. If the left part is a word, then recur on the right part.
  • Count the number of valid words / number of total words in all the final outputs. The one with the best ratio is likely your answer.

For example, giving it "thesentenceisgood" would run:

thesentenceisgood
the sentenceisgood
    sent enceisgood
         enceisgood: OUT1: the sent enceisgood, 2/3
    sentence isgood
             is good
                go od: OUT2: the sentence is go od, 4/5
             is good: OUT3: the sentence is good, 4/4
    sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
      ntenceisgood: OUT5: these ntenceisgood, 1/2

So you would pick OUT3 as the answer.

like image 187
Claudiu Avatar answered Oct 13 '22 00:10

Claudiu


Try a stochastic regular grammar (equivalent to hidden markov models) with the following rules:

for every word in a dictionary:
stream -> word_i stream with probability p_w
word_i -> letter_i1 ...letter_in` with probability q_w (this is the spelling of word_i)
stream -> letter stream with prob p (for any letter)
stream -> epsilon with prob 1

The probabilities could be derived from a training set, but see the following discussion. The most likely parse is computed using the Viterbi algorithm, which has quadratic time complexity in the number of hidden states, in this case your vocabulary, so you could run into speed issues with large vocabularies. But what if you set all the p_w = 1, q_w = 1 p = .5 Which means, these are probabilities in an artificial language model where all words are equally likely and all non-words are equally unlikely. Of course you could segment better if you didn't use this simplification, but the algorithm complexity goes down by quite a bit. If you look at the recurrence relation in the wikipedia entry you can try and simplify it for this special case. The viterbi parse probability up to position k can be simplified to VP(k) = max_l(VP(k-l) * (1 if text[k-l:k] is a word else .5^l) You can bound l with the maximim length of a word and find if a l letters form a word with a hash search. The complexity of this is independent of the vocabulary size and is O(<text length> <max l>). Sorry this is not a proof, just a sketch but should get you going. Another potential optimization, if you create a trie of the dictionary, you can check if a substring is a prefix of any correct word. So when you query text[k-l:k] and get a negative answer, you already know that the same is true for text[k-l:k+d] for any d. To take advantage of this you would have to rearrange the recursion significantly, so I am not sure this can be fully exploited (it can see comment).

like image 27
piccolbo Avatar answered Oct 13 '22 00:10

piccolbo