Let's say I have a 10
feature dataset X
of shape [100, 10]
and a y
target dataset of shape [100, 1]
.
For example, after splitting the two with sklearn.model_selection.train_test_split
I obtained:
X_train: [70, 10]
X_test: [30, 10]
y_train: [70, 1]
y_test: [30, 1]
What is the correct way of apply standardization?
I've tried with:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)
but then if I try to predict using a model, when I try to inverse the scaling for looking at the MAE, I have an error
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(X_train_std, y_train)
y_pred_std = lr.predict(X_test_std)
y_pred = scaler.inverse_transform(y_pred_std) # error here
I have also another question. Since I have the target values, should I use
scaler = preprocessing.StandardScaler()
X_train_std = scaler.fit_transform(X_train, y_train)
X_test_std = scaler.transform(X_test)
instead of the first code block?
Do I have to apply the transformation also to the y_train
and y_test
datasets? I am a bit confuse
StandardScaler
is supposed to be used on the feature matrix X only.
So all the fit
, transform
and inverse_transform
methods just need your X.
Note that after you fit the model, you can access the following attributes:
mean_
: mean of each feature in X_train
scale_
: standard deviation of each feature in X_train
The transform
method does (X[i, col] - mean_[col] / scale_[col])
for each sample i
. Whereas the inverse_transform
method (X[i, col] * scale_[col] + mean_[col])
for each sample i
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With