Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scale train, validation and test sets properly using StandardScaler?

Some articles says that in case of having only train and test sets, first, we need to use fit_transform() to scale training set and then only transform() for test set, in order to prevent data leakage.

In my case, I have also validation set.

I think one of these codes below would be okay to use but I cannot rely on them completely. Any kind of help will be appreciated, thanks!

1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
X_test = scaler.transform(X_test)
like image 203
bbasaran Avatar asked Nov 12 '19 16:11

bbasaran


People also ask

Why do we need a validation set on top of a train set and a test set?

We have to train multiple models by trying different combinations of hyperparameters. Then, we evaluate the performance of each model on the validation set. Therefore, the validation test is useful for hyperparameter tuning or selecting the best model out of different models.

Can you use test set for validation?

In a scenario where both validation and test datasets are used, the test data set is typically used to assess the final model that is selected during the validation process.

What is the difference between train test and validation?

Training datasets comprise samples used to fit models under construction, i.e., carry out the actual AI development. Constructing these robust pillars of AI involves following best practices. In contrast, validation datasets contain different samples to evaluate trained ML models.


1 Answers

Generally you would want to use Option 1 code. The reason for using fit and then transform with train data is a) Fit would calculate mean,var etc of train set and then try to fit the model to data b) post which transform is going to convert data as per the fitted model.

If you use fit again with test set this is going to add bias to your model.

like image 108
Infinite Avatar answered Nov 08 '22 10:11

Infinite