sklearn standardscaler result different to manual result

Tags:

I used the sklearn standardscaler (mean removal and variance scaling) to scale a dataframe and compared it to a dataframe where I "manually" subtracted the mean and divided by the standard deviation. The comparison shows consistent small differences. Can anybody explain why? (The dataset I used is this: http://archive.ics.uci.edu/ml/datasets/Wine

import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])

cols = list(df.columns)[1:]    # I didn't want to scale the "Class" column
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])

df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)

df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]


    Alcohol    Malic acid   Ash         Alcalinity  Magnesium
0   0.004272    -0.001582   0.000653    -0.003290   0.005384
1   0.000693    -0.001405   -0.002329   -0.007007   0.000051
2   0.000554    0.000060    0.003120    -0.000756   0.000249
3   0.004758    -0.000976   0.001373    -0.002276   0.002619
4   0.000832    0.000640    0.005177    0.001271    0.003606
5   0.004168    -0.001455   0.000858    -0.003628   0.002421

385

asked May 27 '17 18:05

Dirk Schulz

1 Answers

scikit-learn uses np.std which by default is the population standard deviation (where the sum of squared deviations are divided by the number of observations) and pandas uses the sample standard deviations (where the denominator is number of observations - 1) (see Wikipedia's standard deviation article). That's a correction factor to have an unbiased estimate of the population standard deviation and determined by the degrees of freedom (ddof). So by default, numpy's and scikit-learn's calculations use ddof=0 whereas pandas uses ddof=1 (docs).

DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

If you change your pandas version to:

df_standardized_manual = (df - df.mean()) / df.std(ddof=0)

The differences will be practically zero:

        Alcohol    Malic acid           Ash  Alcalinity of ash     Magnesium
0 -8.215650e-15 -5.551115e-16  3.191891e-15       0.000000e+00  2.220446e-16
1 -8.715251e-15 -4.996004e-16  3.441691e-15       0.000000e+00  0.000000e+00
2 -8.715251e-15 -3.955170e-16  2.886580e-15      -5.551115e-17  1.387779e-17
3 -8.437695e-15 -4.440892e-16  3.164136e-15      -1.110223e-16  1.110223e-16
4 -8.659740e-15 -3.330669e-16  2.886580e-15       5.551115e-17  2.220446e-16

162

answered Oct 13 '22 02:10

ayhan

Related questions
                            
                                How to add caffe to anaconda on windows?
                            
                                Delete row based on nulls in certain columns (pandas)
                            
                                Treat nan as zero in numpy array summation except for nan in all arrays
                            
                                Finding Patterns in a Numpy Array
                            
                                Check on the stdout of a running subprocess in python
                            
                                pandas merge on date column issue
                            
                                boto3 aws remove all inbound security group rules
                            
                                Testing custom Django middleware without using Django itself
                            
                                Getting all the nodes from Python AST that correspond to a particular variable with a given name
                            
                                How to append multi dimensional array using for loop in python
                            
                                Bar chart pandas Dataframe with Bokeh
                            
                                Pandas: Count Distinct Combinations of two columns and add to Same Dataframe
                            
                                Matplotlib: Move x-axis tick labels one position to left
                            
                                Tensorflow model import to Java
                            
                                How do I rename an index row in Python Pandas? [duplicate]
                            
                                Error: pandas hashtable keyerror
                            
                                Pylint does not work in visual studio code
                            
                                How to convert black and white image to array with 3 dimensions in python?
                            
                                error while trying to install cassandra-driver using python
                            
                                Getting percentages in legend from pie matplotlib pie chart

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn standardscaler result different to manual result

Tags:

python

pandas

scikit-learn

Dirk Schulz

People also ask

1 Answers

ayhan

Recent Activity

Donate For Us