I was wondering how to calculate skewness and kurtosis correctly in pandas.
Pandas gives some values for skew()
and kurtosis()
values but they seem much different from scipy.stats
values. Which one to trust pandas or scipy.stats
?
Here is my code:
import numpy as np
import scipy.stats as stats
import pandas as pd
np.random.seed(100)
x = np.random.normal(size=(20))
kurtosis_scipy = stats.kurtosis(x)
kurtosis_pandas = pd.DataFrame(x).kurtosis()[0]
print(kurtosis_scipy, kurtosis_pandas)
# -0.5270409758168872
# -0.31467107631025604
skew_scipy = stats.skew(x)
skew_pandas = pd.DataFrame(x).skew()[0]
print(skew_scipy, skew_pandas)
# -0.41070929017558555
# -0.44478877631598901
Versions:
print(np.__version__, pd.__version__, scipy.__version__)
1.11.0 0.20.0 0.19.0
The pandas DataFrame has a computing method kurtosis() which computes the kurtosis for a set of values across a specific axis (i.e., a row or a column). Here to analyze Birthweight the skew is -0.1. Observation: If the absolute value of skew<0.5 then very symmetric.
To calculate the sample skewness and sample kurtosis of this dataset, we can use the skew() and kurt() functions from the Scipy Stata librarywith the following syntax: skew(array of values, bias=False) kurt(array of values, bias=False)
Pandas DataFrame skew() Method The skew() method calculates the skew for each column. By specifying the column axis ( axis='columns' ), the skew() method searches column-wise and returns the skew of each row.
For parts (c) and (d), recall that X=a+(b−a)U where U has the uniform distribution on [0,1] (the standard uniform distribution ). Hence it follows from the formulas for skewness and kurtosis under linear transformations that skew(X)=skew(U) and kurt(X)=kurt(U).
bias=False
print(
stats.kurtosis(x, bias=False), pd.DataFrame(x).kurtosis()[0],
stats.skew(x, bias=False), pd.DataFrame(x).skew()[0],
sep='\n'
)
-0.31467107631025515
-0.31467107631025604
-0.4447887763159889
-0.444788776315989
Pandas calculate UNBIASED estimator of the population kurtosis. Look at the Wikipedia for formulas: https://www.wikiwand.com/en/Kurtosis
import numpy as np
import pandas as pd
import scipy
x = np.array([0, 3, 4, 1, 2, 3, 0, 2, 1, 3, 2, 0,
2, 2, 3, 2, 5, 2, 3, 999])
xbar = np.mean(x)
n = x.size
k2 = x.var(ddof=1) # default numpy is biased, ddof = 0
sum_term = ((x-xbar)**4).sum()
factor = (n+1) * n / (n-1) / (n-2) / (n-3)
second = - 3 * (n-1) * (n-1) / (n-2) / (n-3)
first = factor * sum_term / k2 / k2
G2 = first + second
G2 # 19.998428728659768
scipy.stats.kurtosis(x,bias=False) # 19.998428728659757
pd.DataFrame(x).kurtosis() # 19.998429
Similarly, you can also calculate skewness.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With