I have a data set on N numbers that I want to test for normality. I know scipy.stats has a kstest function but there are no examples on how to use it and how to interpret the results. Is anyone here familiar with it that can give me some advice?
According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value. If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.
When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:
import numpy as np from scipy.stats import kstest mu,sigma = 0.07, 0.89 kstest(np.random.normal(mu,sigma,10000),'norm')
I get the following output:
(0.04957880905196102, 8.9249710700788814e-22)
The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!
Can someone understand and explain to me the discrepancy here?
(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)
The Kolmogorov-Smirnov test is used to test whether or not or not a sample comes from a certain distribution. To perform a Kolmogorov-Smirnov test in Python we can use the scipy. stats. kstest() for a one-sample test or scipy.
Two ways to measure KS Statistic One is dependent variable which should be binary. Second one is predicted probability score which is generated from statistical model. Create deciles based on predicted probability columns which means dividing probability into 10 parts.
scipy.stats.ks_2samp(data1, data2)[source] Computes the Kolmogorov-Smirnov statistic on 2 samples. This is a two-sided test for the null hypothesis that 2 independent samples are drawn from the same continuous distribution.
Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.
The null hypothesis (H0
) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.
The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.
In other words, (with p-value ~8.9e-22) it is highly unlikely that H0
is true.
That is reasonable, since the means and std deviations don't match.
Compare your result with:
In [22]: import numpy as np In [23]: import scipy.stats as stats In [24]: stats.kstest(np.random.normal(0,1,10000),'norm') Out[24]: (0.007038739782416259, 0.70477679457831155)
To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:
data=np.random.normal(mu,sigma,10000) normed_data=(data-mu)/sigma print(stats.kstest(normed_data,'norm')) # (0.0085805670733036798, 0.45316245879609179)
Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know mu
and sigma
, estimating the parameters makes the p-value invalid:
import numpy as np import scipy.stats as stats mu = 0.3 sigma = 5 num_tests = 10**5 num_rejects = 0 alpha = 0.05 for i in xrange(num_tests): data = np.random.normal(mu, sigma, 10000) # normed_data = (data - mu) / sigma # this is okay # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected) normed_data = (data - data.mean()) / data.std() # this is NOT okay # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected) D, pval = stats.kstest(normed_data, 'norm') if pval < alpha: num_rejects += 1 ratio = float(num_rejects) / num_tests print('{}/{} = {:.2f} rejects at rejection level {}'.format( num_rejects, num_tests, ratio, alpha))
prints
20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
which shows that stats.kstest
may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation
normed_data = (data - data.mean()) / data.std() # this is NOT okay
An update on unutbu's answer:
For distributions that only depend on the location and scale but do not have a shape parameter, the distributions of several goodness-of-fit test statistics are independent of the location and scale values. The distribution is non-standard, however, it can be tabulated and used with any location and scale of the underlying distribution.
The Kolmogorov-Smirnov test for the normal distribution with estimated location and scale is also called the Lilliefors test.
It is now available in statsmodels, with approximate p-values for the relevant decision range.
>>> import numpy as np >>> mu,sigma = 0.07, 0.89 >>> x = np.random.normal(mu, sigma, 10000) >>> import statsmodels.api as sm >>> sm.stats.lilliefors(x) (0.0055267411213540951, 0.66190841161592895)
Most Monte Carlo studies show that the Anderson-Darling test is more powerful than the Kolmogorov-Smirnov test. It is available in scipy.stats with critical values, and in statsmodels with approximate p-values:
>>> sm.stats.normal_ad(x) (0.23016468240712129, 0.80657628536145665)
Neither of the test rejects the Null hypothesis that the sample is normal distributed. While the kstest in the question rejects the Null hypothesis that the sample is standard normal distributed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With