The fit() function calculates the values of these parameters. The transform function applies the values of the parameters on the actual data and gives the normalized value. The fit_transform() function performs both in the same step. Note that the same value is got whether we perform in 2 steps or in a single step.
fit() - It calculates the parameters/weights on training data (e.g. parameters returned by coef() in case of Linear Regression) and saves them as an internal objects state. predict() - Use the above calculated weights on test data to make the predictions. transform() - Cannot be used. fit_transform() - Cannot be used.
We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.
[...] a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.
In scikit-learn estimator api,
fit()
: used for generating learning model parameters from training data
transform()
:
parameters generated from fit()
method,applied upon model to generate transformed data set.
fit_transform()
:
combination of fit()
and transform()
api on same data set
Checkout Chapter-4 from this book & answer from stackexchange for more clarity
These methods are used to center/feature scale of a given data. It basically helps to normalize the data within a particular range
For this, we use Z-score method.
We do this on the training set of data.
1.Fit(): Method calculates the parameters μ and σ and saves them as internal objects.
2.Transform(): Method using these calculated parameters apply the transformation to a particular dataset.
3.Fit_transform(): joins the fit() and transform() method for transformation of dataset.
Code snippet for Feature Scaling/Standardisation(after train_test_split).
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)
We apply the same(training set same two parameters μ and σ (values)) parameter transformation on our testing set.
The .transform
method is meant for when you have already computed PCA
, i.e. if you have already called its .fit
method.
In [12]: pc2 = RandomizedPCA(n_components=3)
In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-e3b6b8ea2aff> in <module>()
----> 1 pc2.transform(X)
/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
714 # XXX remove scipy.sparse support here in 0.16
715 X = atleast2d_or_csr(X)
--> 716 if self.mean_ is not None:
717 X = X - self.mean_
718
AttributeError: 'RandomizedPCA' object has no attribute 'mean_'
In [14]: pc2.ftransform(X)
pc2.fit pc2.fit_transform
In [14]: pc2.fit_transform(X)
Out[14]:
array([[-1.38340578, -0.2935787 ],
[-2.22189802, 0.25133484],
[-3.6053038 , -0.04224385],
[ 1.38340578, 0.2935787 ],
[ 2.22189802, -0.25133484],
[ 3.6053038 , 0.04224385]])
So you want to fit
RandomizedPCA
and then transform
as:
In [20]: pca = RandomizedPCA(n_components=3)
In [21]: pca.fit(X)
Out[21]:
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
whiten=False)
In [22]: pca.transform(z)
Out[22]:
array([[ 2.76681156, 0.58715739],
[ 1.92831932, 1.13207093],
[ 0.54491354, 0.83849224],
[ 5.53362311, 1.17431479],
[ 6.37211535, 0.62940125],
[ 7.75552113, 0.92297994]])
In [23]:
In particular PCA .transform
applies the change of basis obtained through the PCA decomposition of the matrix X
to the matrix Z
.
In layman's terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So for training set, you need to both calculate and do transformation.
But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn't need to calculate, it just performs the transformation.
Why and When use each one of
fit()
, transform()
, fit_transform()
Usually we have a supervised learning problem with (X, y) as our dataset, and we split it into training data and test data:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_vectorized = model.fit_transform(X_train)
X_test_vectorized = model.transform(X_test)
Imagine we are fitting a tokenizer, if we fit X we are including testing data into the tokenizer, but I have seen this error many times!
The correct is to fit ONLY with X_train, because you don't know "your future data" so you cannot use X_test data for fitting anything!
Then you can transform your test data, but separately, that's why there are different methods.
Final tip: X_train_transformed = model.fit_transform(X_train)
is equivalent to:
X_train_transformed = model.fit(X_train).transform(X_train)
, but the first one is faster.
Note that what I call "model" usually will be a scaler, a tfidf transformer, other kind of vectorizer, a tokenizer...
Remember: X represents the features and y represents the label of each sample. X is a dataframe and y is a pandas Series object (usually)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With