I have some text:
s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?
The simplest approach provided by Python to convert the given list of Sentence into words with separate indices is to use split() method. This method split a string into a list where each word is a list item.
Try this using Biopython (pip install biopython
):
from Bio import trie
import string
def get_trie(dictfile='/usr/share/dict/american-english'):
tr = trie.trie()
with open(dictfile) as f:
for line in f:
word = line.rstrip()
try:
word = word.encode(encoding='ascii', errors='ignore')
tr[word] = len(word)
assert tr.has_key(word), "Missing %s" % word
except UnicodeDecodeError:
pass
return tr
def get_trie_word(tr, s):
for end in reversed(range(len(s))):
word = s[:end + 1]
if tr.has_key(word):
return word, s[end + 1: ]
return None, s
def main(s):
tr = get_trie()
while s:
word, s = get_trie_word(tr, s)
print word
if __name__ == '__main__':
s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
s = s.strip(string.punctuation)
s = s.replace(" ", '')
s = s.lower()
main(s)
>>> if __name__ == '__main__':
... s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
... s = s.strip(string.punctuation)
... s = s.replace(" ", '')
... s = s.lower()
... main(s)
...
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches
There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.
>>> main("expertsexchange")
experts
exchange
This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).
Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With