Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

add stemming support to CountVectorizer (sklearn)

Tags:

I'm trying to add stemming to my pipeline in NLP with sklearn.

from nltk.stem.snowball import FrenchStemmer  stop = stopwords.words('french') stemmer = FrenchStemmer()   class StemmedCountVectorizer(CountVectorizer):     def __init__(self, stemmer):         super(StemmedCountVectorizer, self).__init__()         self.stemmer = stemmer      def build_analyzer(self):         analyzer = super(StemmedCountVectorizer, self).build_analyzer()         return lambda doc:(self.stemmer.stem(w) for w in analyzer(doc))  stem_vectorizer = StemmedCountVectorizer(stemmer) text_clf = Pipeline([('vect', stem_vectorizer), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='linear', C=1)) ]) 

When using this pipeline with the CountVectorizer of sklearn it works. And if I create manually the features like this it works also.

vectorizer = StemmedCountVectorizer(stemmer) vectorizer.fit_transform(X) tfidf_transformer = TfidfTransformer() X_tfidf = tfidf_transformer.fit_transform(X_counts) 

EDIT:

If I try this pipeline on my IPython Notebook it displays the [*] and nothing happens. When I look at my terminal, it gives this error :

Process PoolWorker-12: Traceback (most recent call last):   File "C:\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap     self.run()   File "C:\Anaconda2\lib\multiprocessing\process.py", line 114, in run     self._target(*self._args, **self._kwargs)   File "C:\Anaconda2\lib\multiprocessing\pool.py", line 102, in worker     task = get()   File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 360, in get     return recv() AttributeError: 'module' object has no attribute 'StemmedCountVectorizer' 

Example

Here is the complete example

from sklearn.pipeline import Pipeline from sklearn import grid_search from sklearn.svm import SVC from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from nltk.stem.snowball import FrenchStemmer  stemmer = FrenchStemmer() analyzer = CountVectorizer().build_analyzer()  def stemming(doc):     return (stemmer.stem(w) for w in analyzer(doc))  X = ['le chat est beau', 'le ciel est nuageux', 'les gens sont gentils', 'Paris est magique', 'Marseille est tragique', 'JCVD est fou'] Y = [1,0,1,1,0,0]  text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC())]) parameters = { 'vect__analyzer': ['word', stemming]}  gs_clf = grid_search.GridSearchCV(text_clf, parameters, n_jobs=-1) gs_clf.fit(X, Y) 

If you remove stemming from the parameters it works otherwise it doesn't work.

UPDATE:

The problem seems to be in the parallelization process because when removing n_jobs=-1 the problem disappear.

like image 219
dooms Avatar asked Mar 23 '16 15:03

dooms


People also ask

Does TfIdfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's.

What is Max features in CountVectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

Is CountVectorizer same as bag of words?

Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).

What does Sklearn CountVectorizer do?

Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.


2 Answers

You can pass a callable as analyzer to the CountVectorizer constructor to provide a custom analyzer. This appears to work for me.

from sklearn.feature_extraction.text import CountVectorizer from nltk.stem.snowball import FrenchStemmer  stemmer = FrenchStemmer() analyzer = CountVectorizer().build_analyzer()  def stemmed_words(doc):     return (stemmer.stem(w) for w in analyzer(doc))  stem_vectorizer = CountVectorizer(analyzer=stemmed_words) print(stem_vectorizer.fit_transform(['Tu marches dans la rue'])) print(stem_vectorizer.get_feature_names()) 

Prints out:

  (0, 4)    1   (0, 2)    1   (0, 0)    1   (0, 1)    1   (0, 3)    1 [u'dan', u'la', u'march', u'ru', u'tu'] 
like image 138
joeln Avatar answered Oct 02 '22 07:10

joeln


I know I am little late in posting my answer. But here it is, in case someone still needs help.

Following is the cleanest approach to add language stemmer to count vectorizer by overriding build_analyser()

from sklearn.feature_extraction.text import CountVectorizer import nltk.stem  french_stemmer = nltk.stem.SnowballStemmer('french') class StemmedCountVectorizer(CountVectorizer):     def build_analyzer(self):         analyzer = super(StemmedCountVectorizer, self).build_analyzer()         return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])  vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french') 

You can freely call fit and transform functions of CountVectorizer class over your vectorizer_s object

like image 33
Parth Gupta Avatar answered Oct 02 '22 08:10

Parth Gupta