Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an easy way generate a probable list of words from an unspaced sentence in python?

Tags:

python

nlp

I have some text:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?

like image 833
Erotemic Avatar asked Mar 12 '13 15:03

Erotemic


People also ask

How to create a list of words from a sentence in Python?

The simplest approach provided by Python to convert the given list of Sentence into words with separate indices is to use split() method. This method split a string into a list where each word is a list item.


2 Answers

Greedy approach using trie

Try this using Biopython (pip install biopython):

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

Results

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

Caveats

There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.

Obligatory test

>>> main("expertsexchange")
experts
exchange
like image 106
hughdbrown Avatar answered Oct 07 '22 00:10

hughdbrown


This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).

Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.

like image 25
alvas Avatar answered Oct 07 '22 00:10

alvas