When I use a pipeline object,
Does the pipeline object fit and transform the train data when I use the .fit() method? Or should I use the .fit_transform() method? What is the difference between the two?
When I use the .predict() method on the test data, does the pipeline object transform the test data and only then predict it? That is, should I transform the test data using the .transform() method before I use the .predict() method?
This is the code I have:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
#creating some data
X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
#creating the pipeline
steps = [('scaler', StandardScaler()), ('SelectKBest', SelectKBest(f_classif, k=3)), ('pca', PCA(n_components=2)), ('DT', DecisionTreeClassifier(random_state=0))]
model = Pipeline(steps=steps)
#splitting the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
model.fit(X_train,y_train)
model.predict(X_test)
The Pipeline object is exposing the methods of its last step. As your last step is a DecisionTreeClassifier (an Estimator), the pipeline will not have a fit_transform() but estimator functions such as fit(), predict(), score() etc.
When using fit(), the pipeline will call fit_transform() on all transformer and finally a fit() on the estimator.
When using predict() the pipeline will transform() all the data and then call predict() on the estimator.
As depicted in this picture: (Image from Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print)

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With