I have an array of 1000 or so entries, with examples below: <pre class="prettyprint"><code>wickedweather liquidweather driveourtrucks gocompact slimprojector </code></pre> I would like to be able to split these into their respective words, as: <pre class="prettyprint"><code>wicked weather liquid weather drive our trucks go compact slim projector </code></pre> I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary? I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?

The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.) <pre class="prettyprint lang-py prettyprint-override"><code>import re from collections import Counter def viterbi_segment(text): probs, lasts = [1.0], [0] for i in range(1, len(text) + 1): prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i)) probs.append(prob_k) lasts.append(k) words = [] i = len(text) while 0 < i: words.append(text[lasts[i]:i]) i = lasts[i] words.reverse() return words, probs[-1] def word_prob(word): return dictionary[word] / total def words(text): return re.findall('[a-z]+', text.lower()) dictionary = Counter(words(open('big.txt').read())) max_word_length = max(map(len, dictionary)) total = float(sum(dictionary.values())) </code></pre> Testing it: <pre class="prettyprint"><code>>>> viterbi_segment('wickedweather') (['wicked', 'weather'], 5.1518198982768158e-10) >>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0]) 'its easy for me to split long run together blocks' </code></pre> To be practical you'll likely want a couple refinements: <ul> <li>Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.</li> <li>Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.</li> </ul>

How can I split multiple joined words?

Tags:

string

nlp

I have an array of 1000 or so entries, with examples below:

wickedweather liquidweather driveourtrucks gocompact slimprojector

I would like to be able to split these into their respective words, as:

wicked weather liquid weather drive our trucks go compact slim projector

I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?

I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?

381

asked Oct 12 '08 02:10

Taptronic

1 Answers

The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)

import re from collections import Counter  def viterbi_segment(text):     probs, lasts = [1.0], [0]     for i in range(1, len(text) + 1):         prob_k, k = max((probs[j] * word_prob(text[j:i]), j)                         for j in range(max(0, i - max_word_length), i))         probs.append(prob_k)         lasts.append(k)     words = []     i = len(text)     while 0 < i:         words.append(text[lasts[i]:i])         i = lasts[i]     words.reverse()     return words, probs[-1]  def word_prob(word): return dictionary[word] / total def words(text): return re.findall('[a-z]+', text.lower())  dictionary = Counter(words(open('big.txt').read())) max_word_length = max(map(len, dictionary)) total = float(sum(dictionary.values()))

Testing it:

>>> viterbi_segment('wickedweather') (['wicked', 'weather'], 5.1518198982768158e-10) >>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0]) 'its easy for me to split long run together blocks'

To be practical you'll likely want a couple refinements:

Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.
Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.

answered Oct 09 '22 09:10

Darius Bacon

Related questions
                            
                                Should I avoid converting to a string if a value is already a string?
                            
                                How to dynamically allocate memory space for a string and get that string from user?
                            
                                String termination - char c=0 vs char c='\0'
                            
                                Simplest way of checking for string that contains a string in list? [duplicate]
                            
                                Would a Java HashSet<String>'s contains() method test equality of the strings or object identity?
                            
                                About the changing id of an immutable string
                            
                                Can I import Python's 3.6's formatted string literals (f-strings) into older 3.x, 2.x Python?
                            
                                Ignore case in Python strings [duplicate]
                            
                                How the StringBuilder class is implemented? Does it internally create new string objects each time we append?
                            
                                Confusing output from String.split
                            
                                Sublime Text: Regex to convert Uppercase to Title Case?
                            
                                Using getline() in C++
                            
                                Why do seemingly empty files and strings produce md5sums?
                            
                                Laravel str_random() or custom function?
                            
                                Get the string after a string from a string
                            
                                Sqlite convert string to date
                            
                                What is the best way to create a string array in python?
                            
                                Hive: Convert String to Integer
                            
                                How to remove the character at a given index from a string in C?
                            
                                C# string comparison ignoring spaces, carriage return or line breaks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With