Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between pipeline and make_pipeline in scikit?

I got this from the sklearn webpage:

  • Pipeline: Pipeline of transforms with a final estimator

  • Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.

But I still do not understand when I have to use each one. Can anyone give me an example?

like image 267
Aizzaac Avatar asked Nov 20 '16 18:11

Aizzaac


People also ask

What is Make_pipeline in Sklearn?

This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

What are two advantages of using Sklearn pipelines?

They have several key benefits: They make your workflow much easier to read and understand. They enforce the implementation and order of steps in your project. These in turn make your work much more reproducible.

What is the benefit of using Scikit-learn pipeline utility for data pre processing?

The Scikit-learn pipeline is a tool that chains all steps of the workflow together for a more streamlined procedure. The key benefit of building a pipeline is improved readability. Pipelines are able to execute a series of transformations with one call, allowing users to attain results with less code.


1 Answers

The only difference is that make_pipeline generates names for steps automatically.

Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:

pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()]) param_grid = [{'clf__C': [1, 10, 100, 1000]} gs = GridSearchCV(pipe, param_grid) gs.fit(X, y) 

compare it with make_pipeline:

pipe = make_pipeline(CountVectorizer(), LogisticRegression())      param_grid = [{'logisticregression__C': [1, 10, 100, 1000]} gs = GridSearchCV(pipe, param_grid) gs.fit(X, y) 

So, with Pipeline:

  • names are explicit, you don't have to figure them out if you need them;
  • name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.

make_pipeline:

  • shorter and arguably more readable notation;
  • names are auto-generated using a straightforward rule (lowercase name of an estimator).

When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.

like image 163
Mikhail Korobov Avatar answered Oct 08 '22 10:10

Mikhail Korobov