Is there an easy way generate a probable list of words from an unspaced sentence in python?

Tags:

nlp

I have some text:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?

833

asked Mar 12 '13 15:03

Erotemic

2 Answers

Greedy approach using trie

Try this using Biopython (pip install biopython):

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

Results

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

Caveats

There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.

Obligatory test

>>> main("expertsexchange")
experts
exchange

106

answered Oct 07 '22 00:10

hughdbrown

This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).

Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.

answered Oct 07 '22 00:10

alvas

Related questions
                            
                                How do I print out the full url with tweepy?
                            
                                Labels for Django select form field
                            
                                Python multiprocessing+logging.FileHandler
                            
                                Using @ndb.tasklet or @ndb.synctasklet in Google App Engine
                            
                                How to expand input buffer size of pyserial
                            
                                Escape key to Close WxPython GUI
                            
                                Is there any way to run a python script on remote machine without sending it?
                            
                                Job processing via web application: real-time status updates and backend messaging
                            
                                Debugging options w/ Python, Flask and Sublime Text 2
                            
                                How to get VBOs to work with Python and PyOpenGL
                            
                                Pandas rolling apply with missing data
                            
                                CSV export in Stream (from Django admin on Heroku)
                            
                                Pythonic way to import data from multiple files into an array
                            
                                Python - Understanding the send function of a generator
                            
                                Are there any distributed machine learning libraries for using Python with Hadoop? [closed]
                            
                                Most efficient way to calculate pairwise similarity of 250k lists
                            
                                Writing a Python compiler for practice [closed]
                            
                                Is it possible to specify the previous directory python?
                            
                                Python json.loads doesn't work
                            
                                Iteration over variable names in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With