The skewness is a parameter to measure the symmetry of a data set and the kurtosis to measure how heavy its tails are compared to a normal distribution, see for example here.
scipy.stats
provides an easy way to calculate these two quantities, see scipy.stats.kurtosis
and scipy.stats.skew
.
In my understanding, the skewness and kurtosis of a normal distribution should both be 0 using the functions just mentioned. That is, however, not the case with my code:
import numpy as np from scipy.stats import kurtosis from scipy.stats import skew x = np.linspace( -5, 5, 1000 ) y = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x)**2 ) # normal distribution print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(y) )) print( 'skewness of normal distribution (should be 0): {}'.format( skew(y) ))
The output is:
excess kurtosis of normal distribution (should be 0): -0.307393087742
skewness of normal distribution (should be 0): 1.11082371392
What am I doing wrong ?
The versions I am using are
python: 2.7.6 scipy : 0.17.1 numpy : 1.12.1
A general guideline for skewness is that if the number is greater than +1 or lower than –1, this is an indication of a substantially skewed distribution. For kurtosis, the general guideline is that if the number is greater than +1, the distribution is too peaked.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
If the kurtosis is greater than 3, then the dataset has heavier tails than a normal distribution (more in the tails). If the kurtosis is less than 3, then the dataset has lighter tails than a normal distribution (less in the tails).
“Skewness essentially measures the symmetry of the distribution, while kurtosis determines the heaviness of the distribution tails.” The understanding shape of data is a crucial action. It helps to understand where the most information is lying and analyze the outliers in a given data.
scipy stats.kurtosis() function | Python. scipy.stats.kurtosis(array, axis=0, fisher=True, bias=True) function calculates the kurtosis (Fisher or Pearson) of a data set. It is the the fourth central moment divided by the square of the variance.
As far as I can tell from the documentation, both kurtosis functions compute using Fisher's definition, whereas for skew there doesn't seem to be enough of a description to tell if there any major differences with how they are computed.
Use kurtosistest to see if result is close enough to normal. Data for which the kurtosis is calculated. If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output.
The kurtosis of a normal distribution is 3. If a given distribution has a kurtosis less than 3, it is said to be playkurtic, which means it tends to produce fewer and less extreme outliers than the normal distribution.
These functions calculate moments of the probability density distribution (that's why it takes only one parameter) and doesn't care about the "functional form" of the values.
These are meant for "random datasets" (think of them as measures like mean, standard deviation, variance):
import numpy as np from scipy.stats import kurtosis, skew x = np.random.normal(0, 2, 10000) # create random values based on a normal distribution print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(x) )) print( 'skewness of normal distribution (should be 0): {}'.format( skew(x) ))
which gives:
excess kurtosis of normal distribution (should be 0): -0.024291887786943356 skewness of normal distribution (should be 0): 0.009666157036010928
changing the number of random values increases the accuracy:
x = np.random.normal(0, 2, 10000000)
Leading to:
excess kurtosis of normal distribution (should be 0): -0.00010309478605163847 skewness of normal distribution (should be 0): -0.0006751744848755031
In your case the function "assumes" that each value has the same "probability" (because the values are equally distributed and each value occurs only once) so from the point of view of skew
and kurtosis
it's dealing with a non-gaussian probability density (not sure what exactly this is) which explains why the resulting values aren't even close to 0
:
import numpy as np from scipy.stats import kurtosis, skew x_random = np.random.normal(0, 2, 10000) x = np.linspace( -5, 5, 10000 ) y = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x)**2 ) # normal distribution import matplotlib.pyplot as plt f, (ax1, ax2) = plt.subplots(1, 2) ax1.hist(x_random, bins='auto') ax1.set_title('probability density (random)') ax2.hist(y, bins='auto') ax2.set_title('(your dataset)') plt.tight_layout()
You are using as data the "shape" of the density function. These functions are meant to be used with data sampled from a distribution. If you sample from the distribution, you will obtain sample statistics that will approach the correct value as you increase the sample size. To plot the data, I would recommend a histogram.
%matplotlib inline import numpy as np import pandas as pd from scipy.stats import kurtosis from scipy.stats import skew import matplotlib.pyplot as plt plt.style.use('ggplot') data = np.random.normal(0, 1, 10000000) np.var(data) plt.hist(data, bins=60) print("mean : ", np.mean(data)) print("var : ", np.var(data)) print("skew : ",skew(data)) print("kurt : ",kurtosis(data))
Output:
mean : 0.000410213500847 var : 0.999827716979 skew : 0.00012294118186476907 kurt : 0.0033554829466604374
Unless you are dealing with an analytical expression, it is extremely unlikely that you will obtain a zero when using data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With