I have an array of 1000 or so entries, with examples below:
wickedweather liquidweather driveourtrucks gocompact slimprojector
I would like to be able to split these into their respective words, as:
wicked weather liquid weather drive our trucks go compact slim projector
I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?
I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?
By providing an optional parameter, . split('x') can be used to split a string on a specific substring 'x'. Without 'x' specified, . split() simply splits on all whitespace, as seen above.
To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.
The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)
import re from collections import Counter def viterbi_segment(text): probs, lasts = [1.0], [0] for i in range(1, len(text) + 1): prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i)) probs.append(prob_k) lasts.append(k) words = [] i = len(text) while 0 < i: words.append(text[lasts[i]:i]) i = lasts[i] words.reverse() return words, probs[-1] def word_prob(word): return dictionary[word] / total def words(text): return re.findall('[a-z]+', text.lower()) dictionary = Counter(words(open('big.txt').read())) max_word_length = max(map(len, dictionary)) total = float(sum(dictionary.values()))
Testing it:
>>> viterbi_segment('wickedweather') (['wicked', 'weather'], 5.1518198982768158e-10) >>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0]) 'its easy for me to split long run together blocks'
To be practical you'll likely want a couple refinements:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With