Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word splitting statistical approach

I want to solve word splitting problem (parse words from long string with no spaces). For examle we want extract words from somelongword to [some, long, word].

We can achieve this by some dynamic approach with dictionary, but another issue we encounter is parsing ambiguity. I.e. orcore => or core or orc ore (We don't take into account phrase meaning or part of speech). So i think about usage of some statistical or ML approach.

I found that Naive Bayes and Viterbi algorithm with train set can be used for solving this. Can you point me some information about application of these algorithms to word splitting problem?

UPD: I've implemented this method on Clojure, using some advices from Peter Norvig's code

like image 834
mishadoff Avatar asked Mar 12 '12 10:03

mishadoff


2 Answers

I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.

like image 126
ts. Avatar answered Nov 02 '22 05:11

ts.


This problem is entirely analagous to word segmentation in many Asian languages that don't explicitly encode word boundaries (e.g. Chinese, Thai). If you want background on approaches to the problem, I'd recommend you look at Google Scholar for current Chinese Word Segmentation approaches.

You might start by looking at some older approaches: Sproat, Richard and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff (http://www.sighan.org/bakeoff2003/paper.pdf)

If you want a ready-made solution, I'd recommend LingPipe's tutorial (http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html). I've used it on unsegmented English text with good results. I trained the underlying character language model on a couple million words of newswire text, but I suspect that for this task you'll get reasonable performance using any corpus of relatively normal English text.

They used a spelling-correction system to recommend candidate 'corrections' (where the candidate corrections are identical to the input but with spaces inserted). Their spelling corrector is based on Levenshtein edit distance; they just disallow substitution and transposition, and restrict allowable insertions to only a single space.

like image 32
AaronD Avatar answered Nov 02 '22 05:11

AaronD