Say I had this string:
hellohowareyou
Is there a fast way to separate this into individual words, so the end result is hello how are you? I can think of several ways, but they would be EXTREMELY slow (first I need to identify each letter against a dictionary, see which letters compose a words, and there would be probably multiple combinations, then I need to decide the most likely combination etc.)
Here's some code that does a recursive brute-force search. It puts the word list into a set, so the lookups are quite fast: the examples below run in less than 1 second on my old 2 GHz machine with 2GB of RAM. However, splitting longer sequences than the examples I've used will certainly take longer, mostly because there are so many possible combinations. To weed out the meaningless results you will either need to do that manually, or use software that can do natural language processing.
#!/usr/bin/env python3
''' Separate words
Use dictionary lookups to recursively split a string into separate words
See http://stackoverflow.com/q/41241216/4014959
Written by PM 2Ring 2016.12.21
'''
# Sowpods wordlist from http://www.3zsoftware.com/download/
fname = 'scrabble_wordlist_sowpods.txt'
allwords = set('AI')
with open(fname) as f:
for w in f:
allwords.add(w.strip())
def parse(data, result=None):
if result is None:
result = []
if data in allwords:
result.append(data)
yield result[::-1]
else:
for i in range(1, len(data)):
first, last = data[:i], data[i:]
if last in allwords:
yield from parse(first, result + [last])
# Test
data = (
'HELLOHOWAREYOU',
'THISEXAMPLEWORKSWELL',
'ISTHEREAFASTWAY',
'ONE',
'TWOWORDS',
)
for s in data:
print(s)
for u in parse(s):
print(u)
print('')
output
HELLOHOWAREYOU
['HELL', 'OHO', 'WARE', 'YOU']
['HELLO', 'HO', 'WARE', 'YOU']
['HELLO', 'HOW', 'ARE', 'YOU']
['HELL', 'OH', 'OW', 'ARE', 'YOU']
['HELLO', 'HOW', 'A', 'RE', 'YOU']
['HELL', 'OH', 'OW', 'A', 'RE', 'YOU']
THISEXAMPLEWORKSWELL
['THIS', 'EXAMPLE', 'WORK', 'SWELL']
['THIS', 'EX', 'AMPLE', 'WORK', 'SWELL']
['THIS', 'EXAMPLE', 'WORKS', 'WELL']
['THIS', 'EX', 'AMPLE', 'WORKS', 'WELL']
ISTHEREAFASTWAY
['I', 'ST', 'HER', 'EA', 'FAS', 'TWAY']
['IS', 'THERE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HERE', 'A', 'FAS', 'TWAY']
['IS', 'THE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HER', 'EA', 'FAST', 'WAY']
['IS', 'THERE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FAST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HER', 'EA', 'FA', 'ST', 'WAY']
['IS', 'THERE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FA', 'ST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FA', 'ST', 'WAY']
ONE
['ONE']
TWOWORDS
['TWO', 'WORDS']
This code was written for Python 3, but you can make it run on Python 2 by changing
yield from parse(first, result + [last])
to
for seq in parse(first, result + [last]):
yield seq
BTW, we can sort the output lists by length, i.e., the number of words in each list. This tends to put the more sensible results near the top.
for s in data:
print(s)
for u in sorted(parse(s), key=len):
print(u)
print('')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With