I'm trying to add Lematization to CountVectorizer from Skit-learn,as follows
import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __call__(self, text):
return [lemma(t) for t in word_tokenize(text)]
vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())
sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]
vectorizer.fit_transform(sentence)
This is the output:
[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']
UPDATED
This is the Stopwords that appears and has been lemmatized:
u'lar', u'ler', u'der'
It lemmatice all words and doesn't remove Stopwords. So, any idea?
Thats because lemmatization is done before stop word removal. And then the lemmatized stopwords are not found in the stopwords set provided by stopwords.words('spanish')
.
For complete working order of CountVectorizer, please refer to my other answer here. Its about TfidfVectorizer but the order is same. In that answer, step 3 is the lemmatization and step 4 is stopword removal.
So now to remove the stopwords, you have two options:
1) You lemmatize the stopwords set itself, and then pass it to stop_words
param in CountVectorizer.
my_stop_words = [lemma(t) for t in stopwords.words('spanish')]
vectorizer = CountVectorizer(stop_words=my_stop_words,
tokenizer=LemmaTokenizer())
2) Include the stop word removal in the LemmaTokenizer
itself.
class LemmaTokenizer(object):
def __call__(self, text):
return [lemma(t) for t in word_tokenize(text) if t not in stopwords.words('spanish')]
Try these and comment if not working.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With