NLTK Stopword List

Question

I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words

Hooked · Accepted Answer

A few things of note.

If you are going to be checking membership against a list over and over, I would use a set instead of a list.
stopwords.words('english') returns a list of lowercase stop words. It is quite likely that your source has capital letters in it and is not matching for that reason.
You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

Putting it all together:

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w

NLTK Stopword List

Tags:

python

nltk

stop-words

saph_top

1 Answers

Hooked

Recent Activity

Donate For Us

NLTK Stopword List

Tags:

python

nltk

stop-words

saph_top

1 Answers

Hooked

Related questions

Recent Activity

Donate For Us