Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is exactly sklearn.pipeline.Pipeline?

I can't figure out how the sklearn.pipeline.Pipeline works exactly.

There are a few explanation in the doc. For example what do they mean by:

Pipeline of transforms with a final estimator.

To make my question clearer, what are steps? How do they work?

Edit

Thanks to the answers I can make my question clearer:

When I call pipeline and pass, as steps, two transformers and one estimator, e.g:

pipln = Pipeline([("trsfm1",transformer_1),                   ("trsfm2",transformer_2),                   ("estmtr",estimator)]) 

What happens when I call this?

pipln.fit() OR pipln.fit_transform() 

I can't figure out how an estimator can be a transformer and how a transformer can be fitted.

like image 649
farhawa Avatar asked Oct 12 '15 22:10

farhawa


People also ask

What is Sklearn pipeline pipeline?

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.

What are two advantages of using Sklearn pipelines?

They have several key benefits: They make your workflow much easier to read and understand. They enforce the implementation and order of steps in your project. These in turn make your work much more reproducible.

What is pipeline in Python?

The pipeline is a Python scikit-learn utility for orchestrating machine learning operations. Pipelines function by allowing a linear series of data transforms to be linked together, resulting in a measurable modeling process.

What is pipeline in regression?

What are Regression pipelines? Regression pipelines allow you to predict the value of some numeric attribute for each of your users. If you know the value of that attribute for a subset of users, you can use a Regression pipeline to leverage that information into broad insights about your entire set of users.


2 Answers

Transformer in scikit-learn - some class that have fit and transform method, or fit_transform method.

Predictor - some class that has fit and predict methods, or fit_predict method.

Pipeline is just an abstract notion, it's not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.

Here is a good example of Pipeline usage. Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:

    vect = CountVectorizer()     tfidf = TfidfTransformer()     clf = SGDClassifier()      vX = vect.fit_transform(Xtrain)     tfidfX = tfidf.fit_transform(vX)     predicted = clf.fit_predict(tfidfX)      # Now evaluate all steps on test set     vX = vect.fit_transform(Xtest)     tfidfX = tfidf.fit_transform(vX)     predicted = clf.fit_predict(tfidfX) 

With just:

pipeline = Pipeline([     ('vect', CountVectorizer()),     ('tfidf', TfidfTransformer()),     ('clf', SGDClassifier()), ]) predicted = pipeline.fit(Xtrain).predict(Xtrain) # Now evaluate all steps on test set predicted = pipeline.predict(Xtest) 

With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor. Answer to edit: When you call pipln.fit() - each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can't call fit_transform or transform on pipeline, last step of which is predictor.

like image 141
Ibraim Ganiev Avatar answered Oct 17 '22 12:10

Ibraim Ganiev


I think that M0rkHaV has the right idea. Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let's break down the two major components:

  1. Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you'll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!

  2. Estimators are classes that implement both fit() and predict(). You'll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn't necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn't be able to call predict().

As for your edit: let's go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.

bin = LabelBinarizer()  #first we initialize  vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized 

Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer 'knows' about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn't make any sense. This is true if you print out the list of classes before trying to fit the data.

print bin.classes_   

I get the following error when trying this:

AttributeError: 'LabelBinarizer' object has no attribute 'classes_' 

But when you fit the binarizer on the vec list:

bin.fit(vec) 

and try again

print bin.classes_ 

I get the following:

['cat' 'dog']   print bin.transform(vec) 

And now, after calling transform on the vec object, we get the following:

[[0]  [1]  [1]  [1]] 

As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what's important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.

like image 25
rabbit Avatar answered Oct 17 '22 12:10

rabbit