Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to break a joined string of words into individual words

Tags:

python

string

nlp

Say I had this string:

hellohowareyou

Is there a fast way to separate this into individual words, so the end result is hello how are you? I can think of several ways, but they would be EXTREMELY slow (first I need to identify each letter against a dictionary, see which letters compose a words, and there would be probably multiple combinations, then I need to decide the most likely combination etc.)

like image 906
anemaria20 Avatar asked Oct 21 '25 12:10

anemaria20


1 Answers

Here's some code that does a recursive brute-force search. It puts the word list into a set, so the lookups are quite fast: the examples below run in less than 1 second on my old 2 GHz machine with 2GB of RAM. However, splitting longer sequences than the examples I've used will certainly take longer, mostly because there are so many possible combinations. To weed out the meaningless results you will either need to do that manually, or use software that can do natural language processing.

#!/usr/bin/env python3

''' Separate words

    Use dictionary lookups to recursively split a string into separate words

    See http://stackoverflow.com/q/41241216/4014959

    Written by PM 2Ring 2016.12.21
'''

# Sowpods wordlist from http://www.3zsoftware.com/download/

fname = 'scrabble_wordlist_sowpods.txt'
allwords = set('AI')
with open(fname) as f:
    for w in f:
        allwords.add(w.strip())

def parse(data, result=None):
    if result is None:
        result = []
    if data in allwords:
        result.append(data)
        yield result[::-1]
    else:
        for i in range(1, len(data)):
            first, last = data[:i], data[i:]
            if last in allwords:
                yield from parse(first, result + [last])

# Test

data = (
    'HELLOHOWAREYOU',
    'THISEXAMPLEWORKSWELL',
    'ISTHEREAFASTWAY',
    'ONE',
    'TWOWORDS',
)

for s in data:
    print(s)
    for u in parse(s):
        print(u)
    print('')    

output

HELLOHOWAREYOU
['HELL', 'OHO', 'WARE', 'YOU']
['HELLO', 'HO', 'WARE', 'YOU']
['HELLO', 'HOW', 'ARE', 'YOU']
['HELL', 'OH', 'OW', 'ARE', 'YOU']
['HELLO', 'HOW', 'A', 'RE', 'YOU']
['HELL', 'OH', 'OW', 'A', 'RE', 'YOU']

THISEXAMPLEWORKSWELL
['THIS', 'EXAMPLE', 'WORK', 'SWELL']
['THIS', 'EX', 'AMPLE', 'WORK', 'SWELL']
['THIS', 'EXAMPLE', 'WORKS', 'WELL']
['THIS', 'EX', 'AMPLE', 'WORKS', 'WELL']

ISTHEREAFASTWAY
['I', 'ST', 'HER', 'EA', 'FAS', 'TWAY']
['IS', 'THERE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HERE', 'A', 'FAS', 'TWAY']
['IS', 'THE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HER', 'EA', 'FAST', 'WAY']
['IS', 'THERE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FAST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HER', 'EA', 'FA', 'ST', 'WAY']
['IS', 'THERE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FA', 'ST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FA', 'ST', 'WAY']

ONE
['ONE']

TWOWORDS
['TWO', 'WORDS']

This code was written for Python 3, but you can make it run on Python 2 by changing

yield from parse(first, result + [last])

to

for seq in parse(first, result + [last]):
    yield seq

BTW, we can sort the output lists by length, i.e., the number of words in each list. This tends to put the more sensible results near the top.

for s in data:
    print(s)
    for u in sorted(parse(s), key=len):
        print(u)
    print('')
like image 65
PM 2Ring Avatar answered Oct 23 '25 00:10

PM 2Ring