I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?
Should You Normalize and Encode Data Before Train-Test Splitting, or After Splitting? In theory, it's better to split neural network data into training and test datasets and then normalize and encode each dataset separately.
5.2.Normalization should be applied to the training set, but we should apply the same scaling for the test data. That means storing the scale and offset used with our training data and using that again. A common beginner mistake is to separately normalize train and test data.
Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.
Here is the answer: You should NEVER do anything which leaks information about your testing data BEFORE a split. If you normalize before the split, then you will use the testing data to calculate the range or distribution of this data which leaks this information also into the testing data.
You first need to split the data into training and test set (validation set could be useful too).
Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).
Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.
For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell
As an example, assuming we have the following data:
>>> import numpy as np >>> >>> X, y = np.arange(10).reshape((5, 2)), range(5)
where X
represents our features:
>>> X [[0 1] [2 3] [4 5] [6 7] [8 9]]
and Y
contains the corresponding label
>>> list(y) >>> [0, 1, 2, 3, 4]
Step 1: Create training/testing sets
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) >>> X_train [[4 5] [0 1] [6 7]] >>> >>> X_test [[2 3] [8 9]] >>> >>> y_train [2, 0, 3] >>> >>> y_test [1, 4]
Step 2: Normalise training data
>>> from sklearn import preprocessing >>> >>> normalizer = preprocessing.Normalizer() >>> normalized_train_X = normalizer.fit_transform(X_train) >>> normalized_train_X array([[0.62469505, 0.78086881], [0. , 1. ], [0.65079137, 0.7592566 ]])
Step 3: Normalize testing data
>>> normalized_test_X = normalizer.transform(X_test) >>> normalized_test_X array([[0.5547002 , 0.83205029], [0.66436384, 0.74740932]])
you can use fit then transform learn
normalizer = preprocessing.Normalizer().fit(xtrain)
transform
xtrainnorm = normalizer.transform(xtrain) xtestnorm = normalizer.transform(Xtest)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With