Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fit-transform on training data and transform on test data [duplicate]

I am having trouble understanding how exactly transform() and fit_transform() are working together.

I call fit_transform() on my training data set and transform() on my test set afterwards.

However if I call fit_transform() on the test set I get bad results.

Can anybody give me an explanation how and why this occurs?

like image 950
b4shyou Avatar asked Feb 08 '18 18:02

b4shyou


People also ask

What is difference between transform and fit transform?

The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method. The fit_transform() method does both fits and transform.

Why we use Fit_transform () on training data but transform () on the test data?

fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.

What is the difference between fit Fit_transform and predict methods?

– Remember fit_transform() function only acts on training data, transform() acts on test data, and predict() acts on test data.

Why do we transform test data?

The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data. Now, in a real application, the new, unseen data could be just 1 data point that we want to classify.


1 Answers

Let's take an example of a transform, sklearn.preprocessing.StandardScaler.

From the docs, this will:

Standardize features by removing the mean and scaling to unit variance

Suppose you're working with code like the following.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# X is features, y is label

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

When you call StandardScaler.fit(X_train), what it does is calculate the mean and variance from the values in X_train. Then calling .transform() will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform().

The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data.

If you fit() to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.

Instead, you want to only transform the test data by using the parameters computed on the training data.

like image 199
pault Avatar answered Oct 26 '22 06:10

pault