Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear Regression :: Normalization (Vs) Standardization

I am using Linear regression to predict data. But, I am getting totally contrasting results when I Normalize (Vs) Standardize variables.

Normalization = x -xmin/ xmax – xmin   Zero Score Standardization = x - xmean/ xstd  

a) Also, when to Normalize (Vs) Standardize ? b) How Normalization affects Linear Regression? c) Is it okay if I don't normalize all the attributes/lables in the linear regression? 

Thanks, Santosh

like image 371
Santosh Kumar Avatar asked Aug 20 '15 01:08

Santosh Kumar


People also ask

What is the difference between Standardization and normalization?

Normalization is highly affected by outliers. Standardization is slightly affected by outliers. Normalization is considered when the algorithms do not make assumptions about the data distribution. Standardization is used when algorithms make assumptions about the data distribution.

Why would you use Normalisation and Standardisation for linear regression?

Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks. Standardization assumes that your data has a Gaussian (bell curve) distribution.

Is normalization necessary for linear regression?

All the linear models but linear regression actually require normalization. Lasso, Ridge and Elastic Net regressions are powerful models, but they require normalization because the penalty coefficients are the same for all the variables.


2 Answers

Note that the results might not necessarily be so different. You might simply need different hyperparameters for the two options to give similar results.

The ideal thing is to test what works best for your problem. If you can't afford this for some reason, most algorithms will probably benefit from standardization more so than from normalization.

See here for some examples of when one should be preferred over the other:

For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix; but more about PCA in my previous article).

However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.

One disadvantage of normalization over standardization is that it loses some information in the data, especially about outliers.

Also on the linked page, there is this picture:

Plots of a standardized and normalized data set

As you can see, scaling clusters all the data very close together, which may not be what you want. It might cause algorithms such as gradient descent to take longer to converge to the same solution they would on a standardized data set, or it might even make it impossible.

"Normalizing variables" doesn't really make sense. The correct terminology is "normalizing / scaling the features". If you're going to normalize or scale one feature, you should do the same for the rest.

like image 118
IVlad Avatar answered Sep 21 '22 21:09

IVlad


That makes sense because normalization and standardization do different things.

Normalization transforms your data into a range between 0 and 1

Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1

Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other. We want that so we can be sure we are capturing the true information in a feature, and that we dont over weigh a particular feature just because its values are much larger than other features.

If all of your features are within a similar range of each other then theres no real need to standardize/normalize. If, however, some features naturally take on values that are much larger/smaller than others then normalization/standardization is called for

If you're going to be normalizing at least one variable/feature, I would do the same thing to all of the others as well

like image 24
Simon Avatar answered Sep 19 '22 21:09

Simon