I have a custom tokenizer function with some keyword arguments: <pre class="prettyprint"><code>def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30): do things... return tokens </code></pre> Now, how can I can pass this tokenizer with all its arguments to CountVectorizer? Nothing I tried works; this did not work either: <pre class="prettyprint"><code>from sklearn.feature_extraction.text import CountVectorizer args = {"stem": False, "lemmatize": True} count_vect = CountVectorizer(tokenizer=tokenizer(**args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None) </code></pre> Any help is much appreciated. Thanks in advance.

The <code>tokenizer</code> should be a callable or None. (Is <code>tokenizer=tokenize(**args)</code> a typo? Your function name above is <code>tokenizer</code>.) You can try this: <pre class="prettyprint"><code>count_vect = CountVectorizer(tokenizer=lambda text: tokenizer(text, **args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None) </code></pre>

PYTHON: How to pass tokenizer with keyword arguments to scikit's CountVectorizer?

Tags:

python

keyword-argument

tokenize

scikit-learn

feature-extraction

I have a custom tokenizer function with some keyword arguments:

def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30):
    do things...
    return tokens

Now, how can I can pass this tokenizer with all its arguments to CountVectorizer? Nothing I tried works; this did not work either:

from sklearn.feature_extraction.text import CountVectorizer
args = {"stem": False, "lemmatize": True}
count_vect = CountVectorizer(tokenizer=tokenizer(**args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

Any help is much appreciated. Thanks in advance.

830

asked Aug 05 '15 22:08

JRun

1 Answers

The tokenizer should be a callable or None.

(Is tokenizer=tokenize(**args) a typo? Your function name above is tokenizer.)

You can try this:

count_vect = CountVectorizer(tokenizer=lambda text: tokenizer(text, **args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

134

answered Sep 28 '22 05:09

yangjie

Related questions
                            
                                How to speed up python curve_fit over a 2D array?
                            
                                Replace a value in MultiIndex (pandas)
                            
                                AttributeError: 'DisabledBackend' object has no attribute '_get_task_meta_for'
                            
                                How to Python split by a character yet maintain that character?
                            
                                Django template rendering speed
                            
                                Animate a python pyplot by moving a point plotted via scatter
                            
                                Python Numpy Loadtxt - Convert unix timestamp
                            
                                String regex two mismatches Python
                            
                                How to skip directories in os walk Python 2.7
                            
                                jinja2: TemplateSyntaxError: expected token ',', got 'string'
                            
                                nested Python numpy arrays dimension confusion
                            
                                Is it possible to only parse one argument group's parameters with argparse?
                            
                                fluphenazine read as \xef\xac\x82uphenazine
                            
                                Overload instance[key] += val
                            
                                Logical OR on a subset of columns in a DataFrame
                            
                                Rearranging a (list of lists) matrix using list comprehensions only
                            
                                How can I collect fabric task output and print a summary for multiple hosts?
                            
                                Python difference between 'import as' vs variable assignment
                            
                                Adding new handler to running python tornado server
                            
                                Keep only the first row of consecutive duplicate rows in a DataFrame [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With