I am having trouble understanding how exactly transform()
and fit_transform()
are working together.
I call fit_transform()
on my training data set and transform()
on my test set afterwards.
However if I call fit_transform()
on the test set I get bad results.
Can anybody give me an explanation how and why this occurs?
The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method. The fit_transform() method does both fits and transform.
fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.
– Remember fit_transform() function only acts on training data, transform() acts on test data, and predict() acts on test data.
The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data. Now, in a real application, the new, unseen data could be just 1 data point that we want to classify.
Let's take an example of a transform, sklearn.preprocessing.StandardScaler.
From the docs, this will:
Standardize features by removing the mean and scaling to unit variance
Suppose you're working with code like the following.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# X is features, y is label
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
When you call StandardScaler.fit(X_train)
, what it does is calculate the mean and variance from the values in X_train
. Then calling .transform()
will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform()
.
The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data.
If you fit()
to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.
Instead, you want to only transform the test data by using the parameters computed on the training data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With