Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent output from Pandas apply function with np.std as function parameter

I am using sklearn.preprocessing.StandardScaler to re-scale my data. I want to use np.std to do the same thing with StandardScaler.

However, I find an interesting thing that, with no additional parameter passing in pandas.apply(fun = np.std) , the outputs varies between sample std and population std. (See 2 Problem)

I know there is a parameter called ddof which control the divisor when calculating sample variance.Without changing default parameter ddof = 0, how could I get different output!

1 Dataset:

First, I choose iris dataset as an example. I scale the first column of my data as follows.

from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X_train = iris.data[:,[1]] # my X_train is the first column if iris data
sc = StandardScaler() 
sc.fit(X_train) # Using StandardScaler to scale it!

2 Problem: with no change to default ddof = 0 I got different output of np.std!

import pandas as pd
import sys
print("The mean and std(sample std) of X_train is :")
print(pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0),"\n")

print("The std(population std) of X_train is :")
print(pd.DataFrame(X_train).apply(np.std,axis = 0),"\n") 

print("The std(population std) of X_train is :","{0:.6f}".format(sc.scale_[0]),'\n') 

print("Python version:",sys.version,
      "\npandas version:",pd.__version__,
      "\nsklearn version:",sklearn.__version__)

Out:

The mean and std(sample std) of X_train is :
             0
mean  3.057333
std   0.435866 

The std(population std) of X_train is :
0    0.434411
dtype: float64 

The std(population std) of X_train is : 0.434411 

Python version: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] 
pandas version: 0.23.4 
sklearn version: 0.20.1

From above results, pd.DataFrame(X_train).apply([np.mean,np.std],axis = 0) gives sample std 0.435866 while pd.DataFrame(X_train).apply(np.std,axis = 0) gives population std 0.434411.

3 My questions:

  1. Why using pandas.apply return different results?

  2. How can I pass an additional parameter to np.std, which gives population std?

pd.DataFrame(X_train).apply(np.std,ddof = 1) can do it. But I am wondering that pd.DataFrame(X_train).apply([np.mean,np.std],**args)

like image 789
Travis Avatar asked Nov 07 '22 18:11

Travis


1 Answers

The reason for this behaviour can be found in the (perhaps inelegant) evaluation of .apply() on a Series. If you have a look at the source code, you'll find the following lines:

if isinstance(func, (list, dict)):
    return self.aggregate(func, *args, **kwds)

That means: if you call apply([func]), the results can differ from apply(func)! With regards to np.std, I'd suggest to use the builtin df.std() methods or perhaps df.describe().

You can try out the following code in order to understand what works and what doesn't:

import numpy as np
import pandas as pd

print(10*"-","Showing ddof impact",10*"-")

print(np.std([4,5], ddof=0)) # 0.5      ## N   (population's standard deviation)
print(np.std([4,5], ddof=1)) # 0.707... # N-1 (unbiased sample variance)

x = pd.Series([4,5])

print(10*"-","calling builtin .std() on Series",10*"-")
print(x.std(ddof=0)) # 0.5
print(x.std()) # 0.707

df=pd.DataFrame([[4,5],[5,6]], columns=['A', 'B'])

print(10*"-","calling builtin .std() on DF",10*"-")

print(df["A"].std(ddof=0))# 0.5
print(df["B"].std(ddof=0))# 0.5
print(df["A"].std())# 0.707
print(df["B"].std())# 0.707

print(10*"-","applying np.std to whole DF",10*"-")
print(df.apply(np.std,ddof=0)) # A = 0.5,  B = 0.5
print(df.apply(np.std,ddof=1)) # A = 0.707 B = 0.707

# print(10*"-","applying [np.std] to whole DF WONT work",10*"-")
# print(df.apply([np.std],axis=0,ddof=0)) ## this WONT Work
# print(df.apply([np.std],axis=0,ddof=1)) ## this WONT Work

print(10*"-","applying [np.std] to DF columns",10*"-")
print(df["A"].apply([np.std])) # 0.707
print(df["A"].apply([np.std],ddof=1)) # 0.707

print(10*"-","applying np.std to DF columns",10*"-")
print(df["A"].apply(np.std)) # 0: 0 1: 0 WHOOPS !! #<---------------------
print(30*"-")

You can also figure out what's happening by applying your own function:

def myFun(a):
    print(type(a))
    return np.std(a,ddof=0)

print("> 0",20*"-")    
print(x.apply(myFun))
print("> 1",20*"-","## <- only this will be applied to the Series!")
print(df.apply(myFun))
print("> 2",20*"-","## <- this will be applied to each Int!")
print(df.apply([myFun]))
print("> 3",20*"-")
print(df["A"].apply(myFun))
print("> 4",20*"-")
print(df["A"].apply([myFun]))
like image 87
Asmus Avatar answered Nov 14 '22 23:11

Asmus