Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn standardscaler result different to manual result

I used the sklearn standardscaler (mean removal and variance scaling) to scale a dataframe and compared it to a dataframe where I "manually" subtracted the mean and divided by the standard deviation. The comparison shows consistent small differences. Can anybody explain why? (The dataset I used is this: http://archive.ics.uci.edu/ml/datasets/Wine

import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])

cols = list(df.columns)[1:]    # I didn't want to scale the "Class" column
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])

df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)

df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]


    Alcohol    Malic acid   Ash         Alcalinity  Magnesium
0   0.004272    -0.001582   0.000653    -0.003290   0.005384
1   0.000693    -0.001405   -0.002329   -0.007007   0.000051
2   0.000554    0.000060    0.003120    -0.000756   0.000249
3   0.004758    -0.000976   0.001373    -0.002276   0.002619
4   0.000832    0.000640    0.005177    0.001271    0.003606
5   0.004168    -0.001455   0.000858    -0.003628   0.002421
like image 385
Dirk Schulz Avatar asked May 27 '17 18:05

Dirk Schulz


People also ask

Does StandardScaler normalize data?

Thus, StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY so that each column/feature/variable will have μ = 0 and σ = 1 . The mathematical formulation of the standardization procedure.

Is StandardScaler same as Z score?

where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows: StandardScaler results in a distribution with a standard deviation equal to 1. The variance is equal to 1 also, because variance = standard deviation squared.

How do I normalize data in Python using StandardScaler?

Python sklearn library offers us with StandardScaler() function to standardize the data values into a standard format. According to the above syntax, we initially create an object of the StandardScaler() function. Further, we use fit_transform() along with the assigned object to transform the data and standardize it.

What does the function StandardScaler () from Sklearn preprocessing do?

preprocessing . StandardScaler. Standardize features by removing the mean and scaling to unit variance. where u is the mean of the training samples or zero if with_mean=False , and s is the standard deviation of the training samples or one if with_std=False .


1 Answers

scikit-learn uses np.std which by default is the population standard deviation (where the sum of squared deviations are divided by the number of observations) and pandas uses the sample standard deviations (where the denominator is number of observations - 1) (see Wikipedia's standard deviation article). That's a correction factor to have an unbiased estimate of the population standard deviation and determined by the degrees of freedom (ddof). So by default, numpy's and scikit-learn's calculations use ddof=0 whereas pandas uses ddof=1 (docs).

DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

If you change your pandas version to:

df_standardized_manual = (df - df.mean()) / df.std(ddof=0)

The differences will be practically zero:

        Alcohol    Malic acid           Ash  Alcalinity of ash     Magnesium
0 -8.215650e-15 -5.551115e-16  3.191891e-15       0.000000e+00  2.220446e-16
1 -8.715251e-15 -4.996004e-16  3.441691e-15       0.000000e+00  0.000000e+00
2 -8.715251e-15 -3.955170e-16  2.886580e-15      -5.551115e-17  1.387779e-17
3 -8.437695e-15 -4.440892e-16  3.164136e-15      -1.110223e-16  1.110223e-16
4 -8.659740e-15 -3.330669e-16  2.886580e-15       5.551115e-17  2.220446e-16
like image 162
ayhan Avatar answered Oct 13 '22 02:10

ayhan