I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?

You first need to split the data into training and test set (validation set could be useful too). Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance). Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points. For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell <hr> As an example, assuming we have the following data: <pre class="prettyprint"><code>>>> import numpy as np >>> >>> X, y = np.arange(10).reshape((5, 2)), range(5) </code></pre> where <code>X</code> represents our features: <pre class="prettyprint"><code>>>> X [[0 1] [2 3] [4 5] [6 7] [8 9]] </code></pre> and <code>Y</code> contains the corresponding label <pre class="prettyprint"><code>>>> list(y) >>> [0, 1, 2, 3, 4] </code></pre> <hr> Step 1: Create training/testing sets <pre class="prettyprint"><code>>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) >>> X_train [[4 5] [0 1] [6 7]] >>> >>> X_test [[2 3] [8 9]] >>> >>> y_train [2, 0, 3] >>> >>> y_test [1, 4] </code></pre> Step 2: Normalise training data <pre class="prettyprint"><code>>>> from sklearn import preprocessing >>> >>> normalizer = preprocessing.Normalizer() >>> normalized_train_X = normalizer.fit_transform(X_train) >>> normalized_train_X array([[0.62469505, 0.78086881], [0. , 1. ], [0.65079137, 0.7592566 ]]) </code></pre> Step 3: Normalize testing data <pre class="prettyprint"><code>>>> normalized_test_X = normalizer.transform(X_test) >>> normalized_test_X array([[0.5547002 , 0.83205029], [0.66436384, 0.74740932]]) </code></pre>

you can use fit then transform learn <pre class="prettyprint"><code>normalizer = preprocessing.Normalizer().fit(xtrain) </code></pre> transform <pre class="prettyprint"><code>xtrainnorm = normalizer.transform(xtrain) xtestnorm = normalizer.transform(Xtest) </code></pre>

Normalize data before or after split of training and testing data?

2 Answers

You first need to split the data into training and test set (validation set could be useful too).

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell

As an example, assuming we have the following data:

>>> import numpy as np >>>  >>> X, y = np.arange(10).reshape((5, 2)), range(5)

where X represents our features:

>>> X [[0 1]  [2 3]  [4 5]  [6 7]  [8 9]]

and Y contains the corresponding label

>>> list(y) >>> [0, 1, 2, 3, 4]

Step 1: Create training/testing sets

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)  >>> X_train [[4 5]  [0 1]  [6 7]] >>> >>> X_test [[2 3]  [8 9]] >>> >>> y_train [2, 0, 3] >>> >>> y_test [1, 4]

Step 2: Normalise training data

>>> from sklearn import preprocessing >>>  >>> normalizer = preprocessing.Normalizer() >>> normalized_train_X = normalizer.fit_transform(X_train) >>> normalized_train_X array([[0.62469505, 0.78086881],        [0.        , 1.        ],        [0.65079137, 0.7592566 ]])

Step 3: Normalize testing data

>>> normalized_test_X = normalizer.transform(X_test) >>> normalized_test_X array([[0.5547002 , 0.83205029],        [0.66436384, 0.74740932]])

120

answered Sep 23 '22 16:09

Giorgos Myrianthous

you can use fit then transform learn

normalizer = preprocessing.Normalizer().fit(xtrain)

transform

xtrainnorm = normalizer.transform(xtrain)  xtestnorm = normalizer.transform(Xtest)

answered Sep 21 '22 16:09

user3452134

Related questions
                            
                                scikit-learn random state in splitting dataset
                            
                                How can I implement incremental training for xgboost?
                            
                                Load S3 Data into AWS SageMaker Notebook
                            
                                Unsupervised Sentiment Analysis
                            
                                How to install TensorFlow on Windows?
                            
                                Understanding Neural Network Backpropagation
                            
                                What is a batch in TensorFlow?
                            
                                In which cases is the cross-entropy preferred over the mean squared error? [closed]
                            
                                How do I solve overfitting in random forest of Python sklearn?
                            
                                How to get mini-batches in pytorch in a clean and efficient way?
                            
                                How to install xgboost package in python (windows platform)?
                            
                                How to predict input image using trained model in Keras?
                            
                                TensorFlow: "Attempting to use uninitialized value" in variable initialization
                            
                                Scikit Learn - K-Means - Elbow - criterion
                            
                                How hard is it to implement a chess engine? [closed]
                            
                                Can neural networks approximate any function given enough hidden neurons?
                            
                                What is a projection layer in the context of neural networks?
                            
                                tag generation from a text content
                            
                                What is the inverse of regularization strength in Logistic Regression? How should it affect my code?
                            
                                plotting results of hierarchical clustering ontop of a matrix of data in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Normalize data before or after split of training and testing data?

Tags:

machine-learning

normalization

data-science

training-data

train-test-split

hemant

People also ask

2 Answers

Giorgos Myrianthous

user3452134

Recent Activity

Donate For Us