Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lemmatization on CountVectorizer doesn't remove Stopwords

I'm trying to add Lematization to CountVectorizer from Skit-learn,as follows

import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text)]

vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())

sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]

vectorizer.fit_transform(sentence)

This is the output:

[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']

UPDATED

This is the Stopwords that appears and has been lemmatized:

u'lar', u'ler', u'der'

It lemmatice all words and doesn't remove Stopwords. So, any idea?

like image 610
ambigus9 Avatar asked Mar 07 '23 07:03

ambigus9


1 Answers

Thats because lemmatization is done before stop word removal. And then the lemmatized stopwords are not found in the stopwords set provided by stopwords.words('spanish').

For complete working order of CountVectorizer, please refer to my other answer here. Its about TfidfVectorizer but the order is same. In that answer, step 3 is the lemmatization and step 4 is stopword removal.

So now to remove the stopwords, you have two options:

1) You lemmatize the stopwords set itself, and then pass it to stop_words param in CountVectorizer.

my_stop_words = [lemma(t) for t in stopwords.words('spanish')]
vectorizer = CountVectorizer(stop_words=my_stop_words, 
                             tokenizer=LemmaTokenizer())

2) Include the stop word removal in the LemmaTokenizer itself.

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text) if t not in stopwords.words('spanish')]

Try these and comment if not working.

like image 103
Vivek Kumar Avatar answered May 02 '23 14:05

Vivek Kumar