Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting rid of stop words and document tokenization using NLTK

I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'.

I just can’t figure out what I’m doing wrong, although it’s my first time of doing something like this. Below are my lines of code.I’ll appreciate any suggestions, thanks

    import nltk
    from nltk.corpus import stopwords
    s = open("C:\zircon\sinbo1.txt").read()
    tokens = nltk.word_tokenize(s)
    def cleanupDoc(s):
            stopset = set(stopwords.words('english'))
        tokens = nltk.word_tokenize(s)
        cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
        return cleanup
    cleanupDoc(s)
like image 392
Tiger1 Avatar asked Jun 30 '13 12:06

Tiger1


1 Answers

You can use the stopwords lists from NLTK, see How to remove stop words using nltk or python.

And most probably you would also like to strip off punctuation, you can use string.punctuation, see http://docs.python.org/2/library/string.html:

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = set(stopwords.words('english') + list(string.punctuation))
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']
like image 148
alvas Avatar answered Sep 29 '22 09:09

alvas