Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split multiple joined words?

Tags:

string

nlp

I have an array of 1000 or so entries, with examples below:

wickedweather liquidweather driveourtrucks gocompact slimprojector 

I would like to be able to split these into their respective words, as:

wicked weather liquid weather drive our trucks go compact slim projector 

I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?

I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?

like image 381
Taptronic Avatar asked Oct 12 '08 02:10

Taptronic


People also ask

How do you split joined words in Python?

By providing an optional parameter, . split('x') can be used to split a string on a specific substring 'x'. Without 'x' specified, . split() simply splits on all whitespace, as seen above.

How do I split a string into a list of words?

To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.


1 Answers

The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)

import re from collections import Counter  def viterbi_segment(text):     probs, lasts = [1.0], [0]     for i in range(1, len(text) + 1):         prob_k, k = max((probs[j] * word_prob(text[j:i]), j)                         for j in range(max(0, i - max_word_length), i))         probs.append(prob_k)         lasts.append(k)     words = []     i = len(text)     while 0 < i:         words.append(text[lasts[i]:i])         i = lasts[i]     words.reverse()     return words, probs[-1]  def word_prob(word): return dictionary[word] / total def words(text): return re.findall('[a-z]+', text.lower())  dictionary = Counter(words(open('big.txt').read())) max_word_length = max(map(len, dictionary)) total = float(sum(dictionary.values())) 

Testing it:

>>> viterbi_segment('wickedweather') (['wicked', 'weather'], 5.1518198982768158e-10) >>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0]) 'its easy for me to split long run together blocks' 

To be practical you'll likely want a couple refinements:

  • Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.
  • Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.
like image 86
Darius Bacon Avatar answered Oct 09 '22 09:10

Darius Bacon