I'm trying to set up a scikit-learn pipeline to simplify my work. The problem I'm facing is that I don't know which algorithm (random forest, naive bayes, decision tree etc.) fits best so I need to try each of them and compare the results. However does pipeline only take one algorithms at a time? For example below pipeline only takes in SGDClassifier() as the algorithm.
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),])
What should I do if I want to compare different algorithms? Can I do something like this?
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
('classifier', MultinomialNB()),])
I don't want to break it down into two pipelines because the preprocess of the data is super time consuming.
Thanks in advance!
They have several key benefits: They make your workflow much easier to read and understand. They enforce the implementation and order of steps in your project. These in turn make your work much more reproducible.
The only difference is that make_pipeline generates names for steps automatically.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.
Improving on Bruno's answer, what most people really want to do is be able to pass in ANY classifier (not have to hard-code each one) and also any parameters for each classifier. Here is an easy way to do this:
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
def __init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
def fit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
def predict(self, X, y=None):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
clf__estimator__loss
clf__estimator__loss
is interpreted as the loss
parameter for whatever estimator
is, where estimator = SGDClassifier()
in the top most example and is itself a parameter of clf
which is a ClfSwitcher
object.
You say that preprocessing the data is very slow, so I assume that you consider the TF-IDF Vectorization part of your preprocessing.
You could preprocess just once.
X = <your original data>
from sklearn.feature_extraction.text import TfidfVectorizer
X = TfidfVectorizer().fit_transform(X)
Once you have your new transformed data, you can continue using it and choose the best classifier.
While you could transform your data with TfidfVectorizer
just once, I would not recommend it, because the TfidfVectorizer
has hyper-parameters itself, which can also be optimized. In the end, you want to optimize the whole Pipeline
together, because the parameters for the TfidfVectorizer in
a Pipeline [TfidfVectorizer, SGDClassifier]
can be different than for a Pipeline [TfidfVectorizer, MultinomialNB]
.
To give an answer to what you asked exactly, you could make your own estimator that has the choice of model as a hyper-parameter.
from sklearn.base import BaseEstimator
class MyClassifier(BaseEstimator):
def __init__(self, classifier_type: str = 'SGDClassifier'):
"""
A Custome BaseEstimator that can switch between classifiers.
:param classifier_type: string - The switch for different classifiers
"""
self.classifier_type = classifier_type
def fit(self, X, y=None):
if self.classifier_type == 'SGDClassifier':
self.classifier_ = SGDClassifier()
elif self.classifier_type == 'MultinomialNB':
self.classifier_ = MultinomialNB()
else:
raise ValueError('Unkown classifier type.')
self.classifier_.fit(X, y)
return self
def predict(self, X, y=None):
return self.classifier_.predict(X)
You can then use this customer classifier in your Pipeline
.
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MyClassifier())
])
You can then you GridSearchCV
to choose the best model. When you create a parameter space, you can use double underscore to specify the hyper-parameter of a step in your pipeline
.
parameter_space = {
'clf__classifier_type': ['SGDClassifier', 'MultinomialNB']
}
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(pipeline , parameter_space, n_jobs=-1, cv=5)
search.fit(X, y)
print('Best model:\n', search.best_params_)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With