I am using sklearn.preprocessing.StandardScaler
to re-scale my data. I want to use np.std
to do the same thing with StandardScaler
.
However, I find an interesting thing that, with no additional parameter passing in pandas.apply(fun = np.std)
, the outputs varies between sample std and population std. (See 2 Problem)
I know there is a parameter called ddof
which control the divisor when calculating sample variance.Without changing default parameter ddof = 0
, how could I get different output!
First, I choose iris dataset as an example. I scale the first column of my data as follows.
from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X_train = iris.data[:,[1]] # my X_train is the first column if iris data
sc = StandardScaler()
sc.fit(X_train) # Using StandardScaler to scale it!
ddof = 0
I got different output of np.std!import pandas as pd
import sys
print("The mean and std(sample std) of X_train is :")
print(pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0),"\n")
print("The std(population std) of X_train is :")
print(pd.DataFrame(X_train).apply(np.std,axis = 0),"\n")
print("The std(population std) of X_train is :","{0:.6f}".format(sc.scale_[0]),'\n')
print("Python version:",sys.version,
"\npandas version:",pd.__version__,
"\nsklearn version:",sklearn.__version__)
Out:
The mean and std(sample std) of X_train is :
0
mean 3.057333
std 0.435866
The std(population std) of X_train is :
0 0.434411
dtype: float64
The std(population std) of X_train is : 0.434411
Python version: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
pandas version: 0.23.4
sklearn version: 0.20.1
From above results, pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0)
gives sample std 0.435866 while pd.DataFrame(X_train).apply(np.std,axis = 0)
gives population std 0.434411.
Why using pandas.apply
return different results?
How can I pass an additional parameter to np.std
, which gives population std?
pd.DataFrame(X_train).apply(np.std,ddof = 1)
can do it. But I am wondering that pd.DataFrame(X_train).apply([np.mean,np.std],**args)
The reason for this behaviour can be found in the (perhaps inelegant) evaluation of .apply()
on a Series. If you have a look at the source code, you'll find the following lines:
if isinstance(func, (list, dict)):
return self.aggregate(func, *args, **kwds)
That means: if you call apply([func])
, the results can differ from apply(func)
!
With regards to np.std
, I'd suggest to use the builtin df.std()
methods or perhaps df.describe()
.
You can try out the following code in order to understand what works and what doesn't:
import numpy as np
import pandas as pd
print(10*"-","Showing ddof impact",10*"-")
print(np.std([4,5], ddof=0)) # 0.5 ## N (population's standard deviation)
print(np.std([4,5], ddof=1)) # 0.707... # N-1 (unbiased sample variance)
x = pd.Series([4,5])
print(10*"-","calling builtin .std() on Series",10*"-")
print(x.std(ddof=0)) # 0.5
print(x.std()) # 0.707
df=pd.DataFrame([[4,5],[5,6]], columns=['A', 'B'])
print(10*"-","calling builtin .std() on DF",10*"-")
print(df["A"].std(ddof=0))# 0.5
print(df["B"].std(ddof=0))# 0.5
print(df["A"].std())# 0.707
print(df["B"].std())# 0.707
print(10*"-","applying np.std to whole DF",10*"-")
print(df.apply(np.std,ddof=0)) # A = 0.5, B = 0.5
print(df.apply(np.std,ddof=1)) # A = 0.707 B = 0.707
# print(10*"-","applying [np.std] to whole DF WONT work",10*"-")
# print(df.apply([np.std],axis=0,ddof=0)) ## this WONT Work
# print(df.apply([np.std],axis=0,ddof=1)) ## this WONT Work
print(10*"-","applying [np.std] to DF columns",10*"-")
print(df["A"].apply([np.std])) # 0.707
print(df["A"].apply([np.std],ddof=1)) # 0.707
print(10*"-","applying np.std to DF columns",10*"-")
print(df["A"].apply(np.std)) # 0: 0 1: 0 WHOOPS !! #<---------------------
print(30*"-")
You can also figure out what's happening by apply
ing your own function:
def myFun(a):
print(type(a))
return np.std(a,ddof=0)
print("> 0",20*"-")
print(x.apply(myFun))
print("> 1",20*"-","## <- only this will be applied to the Series!")
print(df.apply(myFun))
print("> 2",20*"-","## <- this will be applied to each Int!")
print(df.apply([myFun]))
print("> 3",20*"-")
print(df["A"].apply(myFun))
print("> 4",20*"-")
print(df["A"].apply([myFun]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With