I have looked quite extensively on stackoverflow and elsewhere and I can't seem to find an answer to the problem below. I am trying to modify a parameter of a function that is itself a parameter inside the <code>GridSearchCV function of sklearn. More specifically, I want to change parameters (here</code>preserve_case = False<code>) inside the</code>casual_tokenize<code>function that is passed to the parameter</code>tokenizer<code>of the function</code>CountVectorizer`. Here's the specific code : <pre class="prettyprint"><code>from sklearn.datasets import fetch_20newsgroups from sklearn.pipeline import Pipeline from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import GridSearchCV from nltk import casual_tokenize </code></pre> Generating dummy data from 20newsgroup <pre class="prettyprint"><code>categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian'] twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) </code></pre> Creating classification pipeline. Note that the tokenizer can be modified using <code>lambda</code>. I am wondering if there's another way to do it since it is not working with <code>GridSearchCV</code> . <pre class="prettyprint"><code>text_clf = Pipeline([('vect', CountVectorizer(tokenizer=lambda text: casual_tokenize(text, preserve_case=False))), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ]) text_clf.fit(twenty_train.data, twenty_train.target) # this works fine </code></pre> I then want to compare the default tokenizer of <code>CountVectorizer</code> with the one in nltk. Note that I am asking the question because I would like to compare more than one tokenizer that each have specific parameters that needs to be specified. <pre class="prettyprint"><code>parameters = {'vect':[CountVectorizer(), CountVectorizer(tokenizer=lambda text: casual_tokenize(text, preserve_case=False))]} gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5) gs_clf = gs_clf.fit(twenty_train.data[:100], twenty_train.target[:100]) </code></pre> <code>gs_clf.fit</code> gives the following error : PicklingError: Can't pickle at 0x1138c5598>: attribute lookup on main failed So my questions are : 1) Does anybody know how to deal with this issue specifically with <code>GridSearchCV</code>. 2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?

<blockquote> 1) Does anybody know how to deal with this issue specifically with GridSearchCV. </blockquote> You can use <code>partial</code> instead of <code>lambda</code> <pre class="prettyprint"><code>from functools import partial from sklearn.externals.joblib import dump def add(a, b): return a + b plus_one = partial(add, b=1) plus_one_lambda = lambda a: a + 1 dump(plus_one, 'add.pkl') # No problem dump(plus_one_lambda, 'add.pkl') # Pickling error </code></pre> For your case: <pre class="prettyprint"><code>tokenizer=partial(casual_tokenize, preserve_case=False) </code></pre> <blockquote> 2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ? </blockquote> I think using <code>lambda</code> or <code>partial</code> are both "pythonic ways". The problem here is that <code>GridSearchCV</code> uses multiprocessing. Which means it may start multiple processes, it have to serialize the parameters in one process and pass them to others (and then the target processes deserialize to get the same parameters). GridSearchCV use <code>joblib</code> for multiprocessing/ serialization. Joblib cannot handle <code>lambda</code> functions.

GridsearchCV: can't pickle function error when trying to pass lambda in parameter

Tags:

python

scikit-learn

grid-search

I have looked quite extensively on stackoverflow and elsewhere and I can't seem to find an answer to the problem below.

I am trying to modify a parameter of a function that is itself a parameter inside the GridSearchCV function of sklearn. More specifically, I want to change parameters (herepreserve_case = False) inside thecasual_tokenizefunction that is passed to the parametertokenizerof the functionCountVectorizer`.

Here's the specific code :

from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from nltk import casual_tokenize

Generating dummy data from 20newsgroup

categories = ['alt.atheism', 'comp.graphics', 'sci.med', 
              'soc.religion.christian']
twenty_train = fetch_20newsgroups(subset='train',
                               categories=categories,
                               shuffle=True,
                               random_state=42)

Creating classification pipeline.
Note that the tokenizer can be modified using lambda. I am wondering if there's another way to do it since it is not working with GridSearchCV .

text_clf = Pipeline([('vect',
                      CountVectorizer(tokenizer=lambda text:
                                     casual_tokenize(text, 
                                     preserve_case=False))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                    ])

text_clf.fit(twenty_train.data, twenty_train.target) # this works fine

I then want to compare the default tokenizer of CountVectorizer with the one in nltk. Note that I am asking the question because I would like to compare more than one tokenizer that each have specific parameters that needs to be specified.

parameters = {'vect':[CountVectorizer(),
                       CountVectorizer(tokenizer=lambda text:
                                       casual_tokenize(text, 
                                       preserve_case=False))]}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5)
gs_clf = gs_clf.fit(twenty_train.data[:100], twenty_train.target[:100])

gs_clf.fit gives the following error : PicklingError: Can't pickle at 0x1138c5598>: attribute lookup on main failed

So my questions are :
1) Does anybody know how to deal with this issue specifically with GridSearchCV.
2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?

426

asked Jun 05 '18 16:06

Eric F

1 Answers

1) Does anybody know how to deal with this issue specifically with GridSearchCV.

You can use partial instead of lambda

from functools import partial
from sklearn.externals.joblib import dump

def add(a, b):
    return a + b

plus_one = partial(add, b=1)
plus_one_lambda = lambda a: a + 1
dump(plus_one, 'add.pkl')          # No problem
dump(plus_one_lambda, 'add.pkl')   # Pickling error

For your case:

tokenizer=partial(casual_tokenize, preserve_case=False)

2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?

I think using lambda or partial are both "pythonic ways".

The problem here is that GridSearchCV uses multiprocessing. Which means it may start multiple processes, it have to serialize the parameters in one process and pass them to others (and then the target processes deserialize to get the same parameters).

GridSearchCV use joblib for multiprocessing/ serialization. Joblib cannot handle lambda functions.

answered Oct 02 '22 20:10

phi

Related questions
                            
                                Properly render text with a given font in Python and accurately detect its boundaries
                            
                                Make new custom view at django admin
                            
                                Tensorflow: InvalidArgumentError: Expected image (JPEG, PNG, or GIF), got empty file
                            
                                Reading Excel file without hidden columns in Python using Pandas or other modules
                            
                                Is there a way to speed up the following pandas for loop?
                            
                                How to decide threshold value in SelectFromModel() for selecting features?
                            
                                Unit testing __main__.py
                            
                                How to initialize repeating tasks using Django Background Tasks?
                            
                                Numpy: assignment destination is read-only - broadcast
                            
                                How to send OpenCV output to browser with python?
                            
                                RxPy: Sort hot observable between (slow) scan executions
                            
                                How to select range of rows in Pandas?
                            
                                Django python paypalrestsdk - No 'Access-Control-Allow-Origin' and ppxo_unhandled_error error
                            
                                python: how to add a new key and a value in yaml file
                            
                                Catching specific error messages in try / except
                            
                                pyspark.sql.types.Row to list
                            
                                deploy django application with pipenv on apache
                            
                                Keras: Method on_batch_end() is slow but I have no callbacks?
                            
                                Add a custom wheel file as a dependency in setup.py?
                            
                                django.db.utils.OperationalError: (1052, "Column 'name' in field list is ambiguous")

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With