Can anyone explain me StandardScaler?

Intro

I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way -- X.shape should be [number_of_samples, number_of_features]).

Core of method

The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.

StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY, so that each column/feature/variable will have μ = 0 and σ = 1.

P.S: I find the most upvoted answer on this page, wrong. I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is neither true nor correct.

See also: How and why to Standardize your data: A python tutorial

Example with code

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

Appendix: The maths

enter image description here

UPDATE 08/2020: Concerning the input parameters with_mean and with_std to False/True, I have provided an answer here: StandardScaler difference between “with_std=False or True” and “with_mean=False or True”

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).

How to calculate it:

enter image description here

You can read more here:

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html#standardization-and-min-max-scaling

StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70 and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.

Related questions
                            
                                Wheel file installation
                            
                                NumPy: function for simultaneous max() and min()
                            
                                Django - how to create a file and save it to a model's FileField?
                            
                                Get Image size WITHOUT loading image into memory
                            
                                How do I fix PyDev "Undefined variable from import" errors?
                            
                                Fast check for NaN in NumPy
                            
                                class method generates "TypeError: ... got multiple values for keyword argument ..."
                            
                                How do you run your own code alongside Tkinter's event loop?
                            
                                Pythonic way of checking if a condition holds for any element of a list
                            
                                Why can't Python find shared objects that are in directories in sys.path?
                            
                                How to convert string to binary?
                            
                                Saving images in Python at a very high quality
                            
                                Check if Python Package is installed
                            
                                Chained method calls indentation style in Python [duplicate]
                            
                                You are trying to add a non-nullable field 'new_field' to userprofile without a default
                            
                                Making an API call in Python with an API that requires a bearer token
                            
                                ImproperlyConfiguredError about app_name when using namespace in include()
                            
                                Compare two columns using pandas
                            
                                How to get instance variables in Python?
                            
                                Problem HTTP error 403 in Python 3 Web Scraping

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can anyone explain me StandardScaler?

Tags:

python

machine-learning

scaling

scikit-learn

standardized

People also ask

Intro

Core of method

Example with code

Appendix: The maths

Recent Activity

Donate For Us