Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply standardization to train and test datasets

Let's say I have a 10 feature dataset X of shape [100, 10] and a ytarget dataset of shape [100, 1]. For example, after splitting the two with sklearn.model_selection.train_test_split I obtained:

  • X_train: [70, 10]
  • X_test: [30, 10]
  • y_train: [70, 1]
  • y_test: [30, 1]

What is the correct way of apply standardization?

I've tried with:

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()

scaler.fit(X_train)

X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

but then if I try to predict using a model, when I try to inverse the scaling for looking at the MAE, I have an error

from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(X_train_std, y_train)
y_pred_std = lr.predict(X_test_std)

y_pred = scaler.inverse_transform(y_pred_std) # error here


I have also another question. Since I have the target values, should I use

scaler = preprocessing.StandardScaler()

X_train_std = scaler.fit_transform(X_train, y_train)
X_test_std = scaler.transform(X_test)

instead of the first code block?


Do I have to apply the transformation also to the y_train and y_test datasets? I am a bit confuse

like image 800
Facosenpai Avatar asked Sep 05 '25 14:09

Facosenpai


1 Answers

StandardScaler is supposed to be used on the feature matrix X only.

So all the fit, transform and inverse_transform methods just need your X.

Note that after you fit the model, you can access the following attributes:

  1. mean_: mean of each feature in X_train
  2. scale_: standard deviation of each feature in X_train

The transform method does (X[i, col] - mean_[col] / scale_[col]) for each sample i. Whereas the inverse_transform method (X[i, col] * scale_[col] + mean_[col]) for each sample i.

like image 150
Jan K Avatar answered Sep 08 '25 12:09

Jan K