Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare multiple algorithms with sklearn pipeline

I'm trying to set up a scikit-learn pipeline to simplify my work. The problem I'm facing is that I don't know which algorithm (random forest, naive bayes, decision tree etc.) fits best so I need to try each of them and compare the results. However does pipeline only take one algorithms at a time? For example below pipeline only takes in SGDClassifier() as the algorithm.

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),])

What should I do if I want to compare different algorithms? Can I do something like this?

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
('classifier', MultinomialNB()),])

I don't want to break it down into two pipelines because the preprocess of the data is super time consuming.

Thanks in advance!

like image 618
vivi11130704 Avatar asked Aug 05 '18 14:08

vivi11130704


People also ask

What are two advantages of using Sklearn pipelines?

They have several key benefits: They make your workflow much easier to read and understand. They enforce the implementation and order of steps in your project. These in turn make your work much more reproducible.

What's the difference between pipeline () and Make_pipeline () from Sklearn library?

The only difference is that make_pipeline generates names for steps automatically.

Is Sklearn a pipeline?

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.


2 Answers

Improving on Bruno's answer, what most people really want to do is be able to pass in ANY classifier (not have to hard-code each one) and also any parameters for each classifier. Here is an easy way to do this:

Create a switcher class that works for any estimator

from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):

def __init__(
    self, 
    estimator = SGDClassifier(),
):
    """
    A Custom BaseEstimator that can switch between classifiers.
    :param estimator: sklearn object - The classifier
    """ 

    self.estimator = estimator


def fit(self, X, y=None, **kwargs):
    self.estimator.fit(X, y)
    return self


def predict(self, X, y=None):
    return self.estimator.predict(X)


def predict_proba(self, X):
    return self.estimator.predict_proba(X)


def score(self, X, y):
    return self.estimator.score(X, y)

Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:

Perform hyper-parameter optimization

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', ClfSwitcher()),
])

parameters = [
    {
        'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': ['english', None],
        'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
        'clf__estimator__max_iter': [50, 80],
        'clf__estimator__tol': [1e-4],
        'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
    },
    {
        'clf__estimator': [MultinomialNB()],
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': [None],
        'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
    },
]

gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)

How to interpret clf__estimator__loss

clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

like image 59
cgnorthcutt Avatar answered Oct 13 '22 02:10

cgnorthcutt


Preprocessing

You say that preprocessing the data is very slow, so I assume that you consider the TF-IDF Vectorization part of your preprocessing.

You could preprocess just once.

X = <your original data>

from sklearn.feature_extraction.text import TfidfVectorizer
X = TfidfVectorizer().fit_transform(X)

Once you have your new transformed data, you can continue using it and choose the best classifier.

Optimizing the TF-IDF Transformer

While you could transform your data with TfidfVectorizer just once, I would not recommend it, because the TfidfVectorizer has hyper-parameters itself, which can also be optimized. In the end, you want to optimize the whole Pipeline together, because the parameters for the TfidfVectorizer ina Pipeline [TfidfVectorizer, SGDClassifier] can be different than for a Pipeline [TfidfVectorizer, MultinomialNB].

Creating a custom classifier

To give an answer to what you asked exactly, you could make your own estimator that has the choice of model as a hyper-parameter.

from sklearn.base import BaseEstimator


class MyClassifier(BaseEstimator):

    def __init__(self, classifier_type: str = 'SGDClassifier'):
        """
        A Custome BaseEstimator that can switch between classifiers.
        :param classifier_type: string - The switch for different classifiers
        """
        self.classifier_type = classifier_type


    def fit(self, X, y=None):
        if self.classifier_type == 'SGDClassifier':
            self.classifier_ = SGDClassifier()
        elif self.classifier_type == 'MultinomialNB':
            self.classifier_ = MultinomialNB()
        else:
            raise ValueError('Unkown classifier type.')

        self.classifier_.fit(X, y)
        return self

    def predict(self, X, y=None):
        return self.classifier_.predict(X)

You can then use this customer classifier in your Pipeline.

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MyClassifier())
])

You can then you GridSearchCV to choose the best model. When you create a parameter space, you can use double underscore to specify the hyper-parameter of a step in your pipeline.

parameter_space = {
    'clf__classifier_type': ['SGDClassifier', 'MultinomialNB']
}

from sklearn.model_selection import GridSearchCV

search = GridSearchCV(pipeline , parameter_space, n_jobs=-1, cv=5)
search.fit(X, y)

print('Best model:\n', search.best_params_)
like image 28
Bruno Lubascher Avatar answered Oct 13 '22 03:10

Bruno Lubascher