Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK stopword removal issue

Tags:

python

nltk

I'm trying to do a document classification, as described in NLTK Chapter 6, and I'm having trouble removing stopwords. When I add

all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))

it returns

Traceback (most recent call last):
  File "fiction.py", line 8, in <module>
    word_features = all_words.keys()[:100]
AttributeError: 'generator' object has no attribute 'keys'

I'm guessing that the stopword code changed the type of object used for 'all_words', rendering they .key() function useless. How can I remove stopwords before using the key function without changing its type? Full code below:

import nltk 
from nltk.corpus import PlaintextCorpusReader

corpus_root = './nltk_data/corpora/fiction'
fiction = PlaintextCorpusReader(corpus_root, '.*')
all_words=nltk.FreqDist(w.lower() for w in fiction.words())
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
word_features = all_words.keys()[:100]

def document_features(document): # [_document-classify-extractor]
    document_words = set(document) # [_document-classify-set]
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

print document_features(fiction.words('fic/11.txt'))
like image 287
user3128184 Avatar asked Oct 03 '22 01:10

user3128184


1 Answers

I would do this by avoiding adding them to the FreqDist instance in the first place:

all_words=nltk.FreqDist(w.lower() for w in fiction.words() if w.lower() not in nltk.corpus.stopwords.words('english'))

Depending on the size of your corpus I think you'd probably get a performance boost out of creating a set for the stopwords before doing that:

stopword_set = frozenset(ntlk.corpus.stopwords.words('english'))

If that's not suitable for your situation, it looks like you can take advantage of the fact that FreqDist inherits from dict:

for stopword in nltk.corpus.stopwords.words('english'):
    if stopword in all_words:
        del all_words[stopword]
like image 81
Peter DeGlopper Avatar answered Oct 13 '22 10:10

Peter DeGlopper