Lemmatization on CountVectorizer doesn't remove Stopwords

Question

I'm trying to add Lematization to CountVectorizer from Skit-learn,as follows

import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text)]

vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())

sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]

vectorizer.fit_transform(sentence)

This is the output:

[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']

UPDATED

This is the Stopwords that appears and has been lemmatized:

u'lar', u'ler', u'der'

It lemmatice all words and doesn't remove Stopwords. So, any idea?

Vivek Kumar · Accepted Answer

Thats because lemmatization is done before stop word removal. And then the lemmatized stopwords are not found in the stopwords set provided by stopwords.words('spanish').

For complete working order of CountVectorizer, please refer to my other answer here. Its about TfidfVectorizer but the order is same. In that answer, step 3 is the lemmatization and step 4 is stopword removal.

So now to remove the stopwords, you have two options:

1) You lemmatize the stopwords set itself, and then pass it to stop_words param in CountVectorizer.

my_stop_words = [lemma(t) for t in stopwords.words('spanish')]
vectorizer = CountVectorizer(stop_words=my_stop_words, 
                             tokenizer=LemmaTokenizer())

2) Include the stop word removal in the LemmaTokenizer itself.

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text) if t not in stopwords.words('spanish')]

Try these and comment if not working.

Lemmatization on CountVectorizer doesn't remove Stopwords

Tags:

nltk

scikit-learn

lemmatization

stop-words

countvectorizer

ambigus9

1 Answers

Vivek Kumar

Recent Activity

Donate For Us

Lemmatization on CountVectorizer doesn't remove Stopwords

Tags:

nltk

scikit-learn

lemmatization

stop-words

countvectorizer

ambigus9

1 Answers

Vivek Kumar

Related questions

Recent Activity

Donate For Us