Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GridsearchCV: can't pickle function error when trying to pass lambda in parameter

I have looked quite extensively on stackoverflow and elsewhere and I can't seem to find an answer to the problem below.

I am trying to modify a parameter of a function that is itself a parameter inside the GridSearchCV function of sklearn. More specifically, I want to change parameters (herepreserve_case = False) inside thecasual_tokenizefunction that is passed to the parametertokenizerof the functionCountVectorizer`.

Here's the specific code :

from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from nltk import casual_tokenize

Generating dummy data from 20newsgroup

categories = ['alt.atheism', 'comp.graphics', 'sci.med', 
              'soc.religion.christian']
twenty_train = fetch_20newsgroups(subset='train',
                               categories=categories,
                               shuffle=True,
                               random_state=42)

Creating classification pipeline.
Note that the tokenizer can be modified using lambda. I am wondering if there's another way to do it since it is not working with GridSearchCV .

text_clf = Pipeline([('vect',
                      CountVectorizer(tokenizer=lambda text:
                                     casual_tokenize(text, 
                                     preserve_case=False))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                    ])

text_clf.fit(twenty_train.data, twenty_train.target) # this works fine

I then want to compare the default tokenizer of CountVectorizer with the one in nltk. Note that I am asking the question because I would like to compare more than one tokenizer that each have specific parameters that needs to be specified.

parameters = {'vect':[CountVectorizer(),
                       CountVectorizer(tokenizer=lambda text:
                                       casual_tokenize(text, 
                                       preserve_case=False))]}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5)
gs_clf = gs_clf.fit(twenty_train.data[:100], twenty_train.target[:100])

gs_clf.fit gives the following error : PicklingError: Can't pickle at 0x1138c5598>: attribute lookup on main failed

So my questions are :
1) Does anybody know how to deal with this issue specifically with GridSearchCV.
2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?

like image 426
Eric F Avatar asked Jun 05 '18 16:06

Eric F


People also ask

What does CV in GridSearchCV stand for?

Cross-Validation and GridSearchCV Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data. In cross-validation, the process divides the train data further into two parts – the train data and the validation data.

How do you define GridSearchCV?

What is GridSearchCV? GridSearchCV is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. As mentioned above, the performance of a model significantly depends on the value of hyperparameters.

What is IID GridSearchCV?

iid : boolean, default=True. If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds. cv : int, cross-validation generator or an iterable, optional. Determines the cross-validation splitting strategy.


1 Answers

1) Does anybody know how to deal with this issue specifically with GridSearchCV.

You can use partial instead of lambda

from functools import partial
from sklearn.externals.joblib import dump

def add(a, b):
    return a + b

plus_one = partial(add, b=1)
plus_one_lambda = lambda a: a + 1
dump(plus_one, 'add.pkl')          # No problem
dump(plus_one_lambda, 'add.pkl')   # Pickling error

For your case:

tokenizer=partial(casual_tokenize, preserve_case=False)

2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?

I think using lambda or partial are both "pythonic ways".

The problem here is that GridSearchCV uses multiprocessing. Which means it may start multiple processes, it have to serialize the parameters in one process and pass them to others (and then the target processes deserialize to get the same parameters).

GridSearchCV use joblib for multiprocessing/ serialization. Joblib cannot handle lambda functions.

like image 86
phi Avatar answered Oct 02 '22 20:10

phi