Compare multiple algorithms with sklearn pipeline

Tags:

I'm trying to set up a scikit-learn pipeline to simplify my work. The problem I'm facing is that I don't know which algorithm (random forest, naive bayes, decision tree etc.) fits best so I need to try each of them and compare the results. However does pipeline only take one algorithms at a time? For example below pipeline only takes in SGDClassifier() as the algorithm.

Click to copy

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),])

What should I do if I want to compare different algorithms? Can I do something like this?

Click to copy

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
('classifier', MultinomialNB()),])

I don't want to break it down into two pipelines because the preprocess of the data is super time consuming.

Thanks in advance!

618

asked Aug 05 '18 14:08

vivi11130704

2 Answers

Improving on Bruno's answer, what most people really want to do is be able to pass in ANY classifier (not have to hard-code each one) and also any parameters for each classifier. Here is an easy way to do this:

Create a switcher class that works for any estimator

Click to copy

from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):

def __init__(
    self, 
    estimator = SGDClassifier(),
):
    """
    A Custom BaseEstimator that can switch between classifiers.
    :param estimator: sklearn object - The classifier
    """ 

    self.estimator = estimator


def fit(self, X, y=None, **kwargs):
    self.estimator.fit(X, y)
    return self


def predict(self, X, y=None):
    return self.estimator.predict(X)


def predict_proba(self, X):
    return self.estimator.predict_proba(X)


def score(self, X, y):
    return self.estimator.score(X, y)

Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:

Perform hyper-parameter optimization

Click to copy

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', ClfSwitcher()),
])

parameters = [
    {
        'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': ['english', None],
        'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
        'clf__estimator__max_iter': [50, 80],
        'clf__estimator__tol': [1e-4],
        'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
    },
    {
        'clf__estimator': [MultinomialNB()],
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': [None],
        'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
    },
]

gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)

How to interpret `clfestimatorloss`

clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

answered Oct 13 '22 02:10

cgnorthcutt

Preprocessing

You say that preprocessing the data is very slow, so I assume that you consider the TF-IDF Vectorization part of your preprocessing.

You could preprocess just once.

Click to copy

X = <your original data>

from sklearn.feature_extraction.text import TfidfVectorizer
X = TfidfVectorizer().fit_transform(X)

Once you have your new transformed data, you can continue using it and choose the best classifier.

Optimizing the TF-IDF Transformer

While you could transform your data with TfidfVectorizer just once, I would not recommend it, because the TfidfVectorizer has hyper-parameters itself, which can also be optimized. In the end, you want to optimize the whole Pipeline together, because the parameters for the TfidfVectorizer ina Pipeline [TfidfVectorizer, SGDClassifier] can be different than for a Pipeline [TfidfVectorizer, MultinomialNB].

Creating a custom classifier

To give an answer to what you asked exactly, you could make your own estimator that has the choice of model as a hyper-parameter.

Click to copy

from sklearn.base import BaseEstimator


class MyClassifier(BaseEstimator):

    def __init__(self, classifier_type: str = 'SGDClassifier'):
        """
        A Custome BaseEstimator that can switch between classifiers.
        :param classifier_type: string - The switch for different classifiers
        """
        self.classifier_type = classifier_type


    def fit(self, X, y=None):
        if self.classifier_type == 'SGDClassifier':
            self.classifier_ = SGDClassifier()
        elif self.classifier_type == 'MultinomialNB':
            self.classifier_ = MultinomialNB()
        else:
            raise ValueError('Unkown classifier type.')

        self.classifier_.fit(X, y)
        return self

    def predict(self, X, y=None):
        return self.classifier_.predict(X)

You can then use this customer classifier in your Pipeline.

Click to copy

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MyClassifier())
])

You can then you GridSearchCV to choose the best model. When you create a parameter space, you can use double underscore to specify the hyper-parameter of a step in your pipeline.

Click to copy

parameter_space = {
    'clf__classifier_type': ['SGDClassifier', 'MultinomialNB']
}

from sklearn.model_selection import GridSearchCV

search = GridSearchCV(pipeline , parameter_space, n_jobs=-1, cv=5)
search.fit(X, y)

print('Best model:\n', search.best_params_)

answered Oct 13 '22 03:10

Bruno Lubascher

Related questions
                            
                                Can't find Python executable
                            
                                How to catch all exceptions in Try/Catch Block Python?
                            
                                How to groupby().transform() to value_counts() in pandas?
                            
                                How to create a pandas dataframe using Tweepy?
                            
                                Multiple tests in one pytest function
                            
                                Writing nested schema to BigQuery from Dataflow (Python)
                            
                                Add 'auto_now' DateTimeField to existing Django model
                            
                                Unable to plot Double Bar, Bar plot using pyplot for ndarray
                            
                                Alternative to nested np.where in Pandas DataFrame
                            
                                How can I make a pandas dataframe out of multiple numpy arrays
                            
                                Gunicorn - No access logs
                            
                                python-igraph how to add edges with weight?
                            
                                How to change python version in command prompt if I have 2 python version installed
                            
                                Select the max row per group - pandas performance issue
                            
                                Python Sort - Semi Ignore Case (a, aa, A, AA, b, bb, B, BB...)
                            
                                Error: non-constant-expression cannot be narrowed from type 'npy_intp' to 'int'
                            
                                Using Pandas, how do I split based on the first space.
                            
                                How do I replace a Python installed from source with a packaged version?
                            
                                AUTH_USER_MODEL refers to model '%s' that has not been installed"
                            
                                python matplotlib heatmap colorbar from transparent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compare multiple algorithms with sklearn pipeline

Tags:

python

algorithm

machine-learning

scikit-learn

vivi11130704

People also ask

2 Answers

Create a switcher class that works for any estimator

Perform hyper-parameter optimization

How to interpret `clfestimatorloss`

cgnorthcutt

Preprocessing

Optimizing the TF-IDF Transformer

Creating a custom classifier

Bruno Lubascher

Recent Activity

Donate For Us

Compare multiple algorithms with sklearn pipeline

Tags:

python

algorithm

machine-learning

scikit-learn

vivi11130704

People also ask

2 Answers

Create a switcher class that works for any estimator

Perform hyper-parameter optimization

How to interpret clf__estimator__loss

cgnorthcutt

Preprocessing

Optimizing the TF-IDF Transformer

Creating a custom classifier

Bruno Lubascher

Related questions

Recent Activity

Donate For Us

How to interpret `clfestimatorloss`