Getting rid of stop words and document tokenization using NLTK

Question

I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'.

I just can’t figure out what I’m doing wrong, although it’s my first time of doing something like this. Below are my lines of code.I’ll appreciate any suggestions, thanks

    import nltk
    from nltk.corpus import stopwords
    s = open("C:\zircon\sinbo1.txt").read()
    tokens = nltk.word_tokenize(s)
    def cleanupDoc(s):
            stopset = set(stopwords.words('english'))
        tokens = nltk.word_tokenize(s)
        cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
        return cleanup
    cleanupDoc(s)

alvas · Accepted Answer

You can use the stopwords lists from NLTK, see How to remove stop words using nltk or python.

And most probably you would also like to strip off punctuation, you can use string.punctuation, see http://docs.python.org/2/library/string.html:

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = set(stopwords.words('english') + list(string.punctuation))
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']

Getting rid of stop words and document tokenization using NLTK

Tags:

python

tokenize

nltk

stop-words

Tiger1

1 Answers

alvas

Recent Activity

Donate For Us

Getting rid of stop words and document tokenization using NLTK

Tags:

python

tokenize

nltk

stop-words

Tiger1

1 Answers

alvas

Related questions

Recent Activity

Donate For Us