Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PYTHON: How to pass tokenizer with keyword arguments to scikit's CountVectorizer?

I have a custom tokenizer function with some keyword arguments:

def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30):
    do things...
    return tokens

Now, how can I can pass this tokenizer with all its arguments to CountVectorizer? Nothing I tried works; this did not work either:

from sklearn.feature_extraction.text import CountVectorizer
args = {"stem": False, "lemmatize": True}
count_vect = CountVectorizer(tokenizer=tokenizer(**args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

Any help is much appreciated. Thanks in advance.

like image 830
JRun Avatar asked Aug 05 '15 22:08

JRun


People also ask

How do you use a CountVectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

What does CountVectorizer fit do?

CountVectorizer. Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy. sparse.


1 Answers

The tokenizer should be a callable or None.

(Is tokenizer=tokenize(**args) a typo? Your function name above is tokenizer.)

You can try this:

count_vect = CountVectorizer(tokenizer=lambda text: tokenizer(text, **args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)
like image 134
yangjie Avatar answered Sep 28 '22 05:09

yangjie