Data Standardization vs Normalization vs Robust Scaler

Tags:

I am working on data preprocessing and want to compare the benefits of Data Standardization vs Normalization vs Robust Scaler practically.

In theory, the guidelines are:

Advantages:

Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1.
Normalization: shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
Robust Scaler: similar to normalization but it instead uses the interquartile range, so that it is robust to outliers.

Disadvantages:

Standardization: not good if the data is not normally distributed (i.e. no Gaussian Distribution).
Normalization: get influenced heavily by outliers (i.e. extreme values).
Robust Scaler: doesn't take the median into account and only focuses on the parts where the bulk data is.

I created 20 random numerical inputs and tried the above-mentioned methods (numbers in red color represent the outliers):

Methods Comparison

I noticed that -indeed- the Normalization got affected negatively by the outliers and the change scale between the new values became tiny (all values almost identical -6 digits after the decimal point- 0.000000x) even there is noticeable differences between the original inputs!

My questions are:

Am I right to say that also Standardization gets affected negatively by the extreme values as well? If not, why according to the result provided?
I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple complete interpretation?

895

asked Aug 14 '18 12:08

Mike

1 Answers

Am I right to say that also Standardization gets affected negatively by the extreme values as well?

Indeed you are; the scikit-learn docs themselves clearly warn for such a case:

However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.

More or less, the same holds true for the MinMaxScaler as well.

I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?

Robust does not mean immune, or invulnerable, and the purpose of scaling is not to "remove" outliers and extreme values - this is a separate task with its own methodologies; this is again clearly mentioned in the relevant scikit-learn docs:

RobustScaler

[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).

where the "see below" refers to the QuantileTransformer and quantile_transform.

166

answered Sep 29 '22 01:09

desertnaut

Related questions
                            
                                Null coalescing operator IList, Array, Enumerable.Empty in foreach
                            
                                Safely convert std::string_view to int (like stoi or atoi)
                            
                                What the difference between Array.prototype.isPrototypeOf and Array.isPrototypeOf?
                            
                                How can I stop MATLAB from returning until after a command-line script completes?
                            
                                Calling .NET Web Service (WSE 2/3, WS-Security) from Java
                            
                                Getting Started With Java EE [duplicate]
                            
                                ReportViewer Client Print Control "Unable to load client print control"?
                            
                                Exclude complete namespace from FxCop code analysis?
                            
                                Lectures available as MP3/podcasts? [closed]
                            
                                What Is Your Software Development Directory Structure?
                            
                                Hibernate 3.4 with slf4j and log4j
                            
                                Passing a delegate with two parameters as a parameter function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With