Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing a Kolmogorov Smirnov test in python scipy

I have a data set on N numbers that I want to test for normality. I know scipy.stats has a kstest function but there are no examples on how to use it and how to interpret the results. Is anyone here familiar with it that can give me some advice?

According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value. If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.

When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:

import numpy as np from scipy.stats import kstest  mu,sigma = 0.07, 0.89 kstest(np.random.normal(mu,sigma,10000),'norm') 

I get the following output:

(0.04957880905196102, 8.9249710700788814e-22)

The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!

Can someone understand and explain to me the discrepancy here?

(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)

like image 549
Hooloovoo Avatar asked Oct 26 '11 14:10

Hooloovoo


People also ask

How does Python calculate Kolmogorov-Smirnov test?

The Kolmogorov-Smirnov test is used to test whether or not or not a sample comes from a certain distribution. To perform a Kolmogorov-Smirnov test in Python we can use the scipy. stats. kstest() for a one-sample test or scipy.

How do you calculate k statistic for logistic regression in Python?

Two ways to measure KS Statistic One is dependent variable which should be binary. Second one is predicted probability score which is generated from statistical model. Create deciles based on predicted probability columns which means dividing probability into 10 parts.

What is ks_2samp?

scipy.stats.ks_2samp(data1, data2)[source] Computes the Kolmogorov-Smirnov statistic on 2 samples. This is a two-sided test for the null hypothesis that 2 independent samples are drawn from the same continuous distribution.


2 Answers

Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.

The null hypothesis (H0) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.

The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.

In other words, (with p-value ~8.9e-22) it is highly unlikely that H0 is true.

That is reasonable, since the means and std deviations don't match.

Compare your result with:

In [22]: import numpy as np In [23]: import scipy.stats as stats In [24]: stats.kstest(np.random.normal(0,1,10000),'norm') Out[24]: (0.007038739782416259, 0.70477679457831155) 

To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:

data=np.random.normal(mu,sigma,10000) normed_data=(data-mu)/sigma print(stats.kstest(normed_data,'norm')) # (0.0085805670733036798, 0.45316245879609179) 

Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know mu and sigma, estimating the parameters makes the p-value invalid:

import numpy as np import scipy.stats as stats  mu = 0.3 sigma = 5  num_tests = 10**5 num_rejects = 0 alpha = 0.05 for i in xrange(num_tests):     data = np.random.normal(mu, sigma, 10000)     # normed_data = (data - mu) / sigma    # this is okay     # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)     normed_data = (data - data.mean()) / data.std()    # this is NOT okay     # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)     D, pval = stats.kstest(normed_data, 'norm')     if pval < alpha:         num_rejects += 1 ratio = float(num_rejects) / num_tests print('{}/{} = {:.2f} rejects at rejection level {}'.format(     num_rejects, num_tests, ratio, alpha))      

prints

20/100000 = 0.00 rejects at rejection level 0.05 (not expected) 

which shows that stats.kstest may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation

normed_data = (data - data.mean()) / data.std()    # this is NOT okay 
like image 82
unutbu Avatar answered Sep 27 '22 18:09

unutbu


An update on unutbu's answer:

For distributions that only depend on the location and scale but do not have a shape parameter, the distributions of several goodness-of-fit test statistics are independent of the location and scale values. The distribution is non-standard, however, it can be tabulated and used with any location and scale of the underlying distribution.

The Kolmogorov-Smirnov test for the normal distribution with estimated location and scale is also called the Lilliefors test.

It is now available in statsmodels, with approximate p-values for the relevant decision range.

>>> import numpy as np >>> mu,sigma = 0.07, 0.89 >>> x = np.random.normal(mu, sigma, 10000) >>> import statsmodels.api as sm >>> sm.stats.lilliefors(x) (0.0055267411213540951, 0.66190841161592895) 

Most Monte Carlo studies show that the Anderson-Darling test is more powerful than the Kolmogorov-Smirnov test. It is available in scipy.stats with critical values, and in statsmodels with approximate p-values:

>>> sm.stats.normal_ad(x) (0.23016468240712129, 0.80657628536145665) 

Neither of the test rejects the Null hypothesis that the sample is normal distributed. While the kstest in the question rejects the Null hypothesis that the sample is standard normal distributed.

like image 31
Josef Avatar answered Sep 27 '22 20:09

Josef