Let's say I have a 10 feature dataset X of shape [100, 10] and a ytarget dataset of shape [100, 1].
For example, after splitting the two with sklearn.model_selection.train_test_split I obtained:
X_train: [70, 10]X_test: [30, 10]y_train: [70, 1]y_test: [30, 1]What is the correct way of apply standardization?
I've tried with:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)
but then if I try to predict using a model, when I try to inverse the scaling for looking at the MAE, I have an error
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(X_train_std, y_train)
y_pred_std = lr.predict(X_test_std)
y_pred = scaler.inverse_transform(y_pred_std) # error here
I have also another question. Since I have the target values, should I use
scaler = preprocessing.StandardScaler()
X_train_std = scaler.fit_transform(X_train, y_train)
X_test_std = scaler.transform(X_test)
instead of the first code block?
Do I have to apply the transformation also to the y_train and y_test datasets? I am a bit confuse
StandardScaler is supposed to be used on the feature matrix X only.
So all the fit, transform and inverse_transform methods just need your X.
Note that after you fit the model, you can access the following attributes:
mean_: mean of each feature in X_trainscale_: standard deviation of each feature in X_trainThe transform method does (X[i, col] - mean_[col] / scale_[col]) for each sample i. Whereas the inverse_transform method (X[i, col] * scale_[col] + mean_[col]) for each sample i.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With