Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK Stopword List

I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words
like image 681
saph_top Avatar asked Mar 31 '14 13:03

saph_top


1 Answers

A few things of note.

  • If you are going to be checking membership against a list over and over, I would use a set instead of a list.

  • stopwords.words('english') returns a list of lowercase stop words. It is quite likely that your source has capital letters in it and is not matching for that reason.

  • You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

Putting it all together:

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w
like image 185
Hooked Avatar answered Oct 17 '22 21:10

Hooked