Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalize data before or after split of training and testing data?

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?

like image 423
hemant Avatar asked Mar 23 '18 07:03

hemant


People also ask

Should I normalize before or after test train split?

Should You Normalize and Encode Data Before Train-Test Splitting, or After Splitting? In theory, it's better to split neural network data into training and test datasets and then normalize and encode each dataset separately.

Should we normalize data before training?

5.2.Normalization should be applied to the training set, but we should apply the same scaling for the test data. That means storing the scale and offset used with our training data and using that again. A common beginner mistake is to separately normalize train and test data.

When should you normalize your data?

Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

Why it is wrong to normalize all the data first and then split it into a training set and a test set?

Here is the answer: You should NEVER do anything which leaks information about your testing data BEFORE a split. If you normalize before the split, then you will use the testing data to calculate the range or distribution of this data which leaks this information also into the testing data.


2 Answers

You first need to split the data into training and test set (validation set could be useful too).

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell


As an example, assuming we have the following data:

>>> import numpy as np >>>  >>> X, y = np.arange(10).reshape((5, 2)), range(5) 

where X represents our features:

>>> X [[0 1]  [2 3]  [4 5]  [6 7]  [8 9]] 

and Y contains the corresponding label

>>> list(y) >>> [0, 1, 2, 3, 4] 

Step 1: Create training/testing sets

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)  >>> X_train [[4 5]  [0 1]  [6 7]] >>> >>> X_test [[2 3]  [8 9]] >>> >>> y_train [2, 0, 3] >>> >>> y_test [1, 4] 

Step 2: Normalise training data

>>> from sklearn import preprocessing >>>  >>> normalizer = preprocessing.Normalizer() >>> normalized_train_X = normalizer.fit_transform(X_train) >>> normalized_train_X array([[0.62469505, 0.78086881],        [0.        , 1.        ],        [0.65079137, 0.7592566 ]]) 

Step 3: Normalize testing data

>>> normalized_test_X = normalizer.transform(X_test) >>> normalized_test_X array([[0.5547002 , 0.83205029],        [0.66436384, 0.74740932]]) 
like image 120
Giorgos Myrianthous Avatar answered Sep 23 '22 16:09

Giorgos Myrianthous


you can use fit then transform learn

normalizer = preprocessing.Normalizer().fit(xtrain) 

transform

xtrainnorm = normalizer.transform(xtrain)  xtestnorm = normalizer.transform(Xtest)  
like image 25
user3452134 Avatar answered Sep 21 '22 16:09

user3452134