Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Standardization vs Normalization vs Robust Scaler

Tags:

I am working on data preprocessing and want to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.

In theory, the guidelines are:

Advantages:

  1. Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
  2. Normalization: shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
  3. Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.

Disadvantages:

  1. Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
  2. Normalization: get influenced heavily by outliers (i.e. extreme values).
  3. Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.

I created 20 random numerical inputs and tried the above-mentioned methods (numbers in red color represent the outliers):

Methods Comparison


I noticed that -indeed- the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical -6 digits after the decimal point- 0.000000x) even there is noticeable differences between the original inputs!

My questions are:

  1. Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, why according to the result provided?
  2. I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?
like image 895
Mike Avatar asked Aug 14 '18 12:08

Mike


People also ask

Does robust scaler normalize?

This tutorial explains how to use the robust scaler encoding from scikit-learn. This scaler normalizes the data by subtracting the median and dividing by the interquartile range. This scaler is robust to outliers unlike the standard scaler.

What is the difference between data Normalization and Standardization?

Normalization is highly affected by outliers. Standardization is slightly affected by outliers. Normalization is considered when the algorithms do not make assumptions about the data distribution. Standardization is used when algorithms make assumptions about the data distribution.

What is robust scaler used for?

Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).


1 Answers

Am I right to say that also Standardization gets affected negatively by the extreme values as well?

Indeed you are; the scikit-learn docs themselves clearly warn for such a case:

However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.

More or less, the same holds true for the MinMaxScaler as well.

I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?

Robust does not mean immune, or invulnerable, and the purpose of scaling is not to "remove" outliers and extreme values - this is a separate task with its own methodologies; this is again clearly mentioned in the relevant scikit-learn docs:

RobustScaler

[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).

where the "see below" refers to the QuantileTransformer and quantile_transform.

like image 166
desertnaut Avatar answered Sep 29 '22 01:09

desertnaut