Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn partial fit of CountVectorizer

Does CountVectorizer support partial fit?

I would like to train the CountVectorizer using different batches of data.

like image 294
Donbeo Avatar asked Oct 27 '16 15:10

Donbeo


People also ask

What does CountVectorizer fit do?

CountVectorizer. Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy. sparse.

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

Does CountVectorizer remove punctuation?

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.


1 Answers

No, it does not support partial fit.

But you can write a simple method to accomplish your goal:

def partial_fit(self , data):
    if(hasattr(vectorizer , 'vocabulary_')):
        vocab = self.vocabulary_
    else:
        vocab = {}
    self.fit(data)
    vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
    self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}

from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit

vectorizer = CountVectorizer(stop_words=l)
vectorizer.fit(df[15].values[0:100])
vectorizer.partial_fit(df[15].values[100:200])
like image 139
sajjad Avatar answered Sep 28 '22 07:09

sajjad