I have a data set on N numbers that I want to test for normality. I know scipy.stats has a kstest function but there are no examples on how to use it and how to interpret the results. Is anyone here familiar with it that can give me some advice? According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value. If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution. When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity: <pre class="prettyprint"><code>import numpy as np from scipy.stats import kstest mu,sigma = 0.07, 0.89 kstest(np.random.normal(mu,sigma,10000),'norm') </code></pre> I get the following output: <blockquote> (0.04957880905196102, 8.9249710700788814e-22) </blockquote> The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution! Can someone understand and explain to me the discrepancy here? (Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)

Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1. The null hypothesis (<code>H0</code>) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1. The small p-value is indicating that a test statistic as large as D would be expected with probability p-value. In other words, (with p-value ~8.9e-22) it is highly unlikely that <code>H0</code> is true. That is reasonable, since the means and std deviations don't match. Compare your result with: <pre class="prettyprint"><code>In [22]: import numpy as np In [23]: import scipy.stats as stats In [24]: stats.kstest(np.random.normal(0,1,10000),'norm') Out[24]: (0.007038739782416259, 0.70477679457831155) </code></pre> To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1: <pre class="prettyprint"><code>data=np.random.normal(mu,sigma,10000) normed_data=(data-mu)/sigma print(stats.kstest(normed_data,'norm')) # (0.0085805670733036798, 0.45316245879609179) </code></pre> <hr> Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know <code>mu</code> and <code>sigma</code>, estimating the parameters makes the p-value invalid: <pre class="prettyprint"><code>import numpy as np import scipy.stats as stats mu = 0.3 sigma = 5 num_tests = 10**5 num_rejects = 0 alpha = 0.05 for i in xrange(num_tests): data = np.random.normal(mu, sigma, 10000) # normed_data = (data - mu) / sigma # this is okay # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected) normed_data = (data - data.mean()) / data.std() # this is NOT okay # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected) D, pval = stats.kstest(normed_data, 'norm') if pval < alpha: num_rejects += 1 ratio = float(num_rejects) / num_tests print('{}/{} = {:.2f} rejects at rejection level {}'.format( num_rejects, num_tests, ratio, alpha)) </code></pre> prints <pre class="prettyprint"><code>20/100000 = 0.00 rejects at rejection level 0.05 (not expected) </code></pre> which shows that <code>stats.kstest</code> may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation <pre class="prettyprint"><code>normed_data = (data - data.mean()) / data.std() # this is NOT okay </code></pre>

An update on unutbu's answer: For distributions that only depend on the location and scale but do not have a shape parameter, the distributions of several goodness-of-fit test statistics are independent of the location and scale values. The distribution is non-standard, however, it can be tabulated and used with any location and scale of the underlying distribution. The Kolmogorov-Smirnov test for the normal distribution with estimated location and scale is also called the Lilliefors test. It is now available in statsmodels, with approximate p-values for the relevant decision range. <pre class="prettyprint"><code>>>> import numpy as np >>> mu,sigma = 0.07, 0.89 >>> x = np.random.normal(mu, sigma, 10000) >>> import statsmodels.api as sm >>> sm.stats.lilliefors(x) (0.0055267411213540951, 0.66190841161592895) </code></pre> Most Monte Carlo studies show that the Anderson-Darling test is more powerful than the Kolmogorov-Smirnov test. It is available in scipy.stats with critical values, and in statsmodels with approximate p-values: <pre class="prettyprint"><code>>>> sm.stats.normal_ad(x) (0.23016468240712129, 0.80657628536145665) </code></pre> Neither of the test rejects the Null hypothesis that the sample is normal distributed. While the kstest in the question rejects the Null hypothesis that the sample is standard normal distributed.

Implementing a Kolmogorov Smirnov test in python scipy

Tags:

python

statistics

scipy

statistical-test

I have a data set on N numbers that I want to test for normality. I know scipy.stats has a kstest function but there are no examples on how to use it and how to interpret the results. Is anyone here familiar with it that can give me some advice?

According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value. If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.

When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:

import numpy as np from scipy.stats import kstest  mu,sigma = 0.07, 0.89 kstest(np.random.normal(mu,sigma,10000),'norm')

I get the following output:

(0.04957880905196102, 8.9249710700788814e-22)

The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!

Can someone understand and explain to me the discrepancy here?

(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)

549

asked Oct 26 '11 14:10

Hooloovoo

2 Answers

Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.

The null hypothesis (H0) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.

The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.

In other words, (with p-value ~8.9e-22) it is highly unlikely that H0 is true.

That is reasonable, since the means and std deviations don't match.

Compare your result with:

In [22]: import numpy as np In [23]: import scipy.stats as stats In [24]: stats.kstest(np.random.normal(0,1,10000),'norm') Out[24]: (0.007038739782416259, 0.70477679457831155)

To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:

data=np.random.normal(mu,sigma,10000) normed_data=(data-mu)/sigma print(stats.kstest(normed_data,'norm')) # (0.0085805670733036798, 0.45316245879609179)

Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know mu and sigma, estimating the parameters makes the p-value invalid:

import numpy as np import scipy.stats as stats  mu = 0.3 sigma = 5  num_tests = 10**5 num_rejects = 0 alpha = 0.05 for i in xrange(num_tests):     data = np.random.normal(mu, sigma, 10000)     # normed_data = (data - mu) / sigma    # this is okay     # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)     normed_data = (data - data.mean()) / data.std()    # this is NOT okay     # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)     D, pval = stats.kstest(normed_data, 'norm')     if pval < alpha:         num_rejects += 1 ratio = float(num_rejects) / num_tests print('{}/{} = {:.2f} rejects at rejection level {}'.format(     num_rejects, num_tests, ratio, alpha))

prints

20/100000 = 0.00 rejects at rejection level 0.05 (not expected)

which shows that stats.kstest may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation

normed_data = (data - data.mean()) / data.std()    # this is NOT okay

answered Sep 27 '22 18:09

unutbu

An update on unutbu's answer:

For distributions that only depend on the location and scale but do not have a shape parameter, the distributions of several goodness-of-fit test statistics are independent of the location and scale values. The distribution is non-standard, however, it can be tabulated and used with any location and scale of the underlying distribution.

The Kolmogorov-Smirnov test for the normal distribution with estimated location and scale is also called the Lilliefors test.

It is now available in statsmodels, with approximate p-values for the relevant decision range.

>>> import numpy as np >>> mu,sigma = 0.07, 0.89 >>> x = np.random.normal(mu, sigma, 10000) >>> import statsmodels.api as sm >>> sm.stats.lilliefors(x) (0.0055267411213540951, 0.66190841161592895)

Most Monte Carlo studies show that the Anderson-Darling test is more powerful than the Kolmogorov-Smirnov test. It is available in scipy.stats with critical values, and in statsmodels with approximate p-values:

>>> sm.stats.normal_ad(x) (0.23016468240712129, 0.80657628536145665)

Neither of the test rejects the Null hypothesis that the sample is normal distributed. While the kstest in the question rejects the Null hypothesis that the sample is standard normal distributed.

answered Sep 27 '22 20:09

Josef

Related questions
                            
                                OLS Regression: Scikit vs. Statsmodels? [closed]
                            
                                Passing arguments to superclass constructor without repeating them in childclass constructor
                            
                                Open IPython notebooks (*.ipynb) in read-only view (like a HTML file)
                            
                                Tensorflow : What is the relationship between .ckpt file and .ckpt.meta and .ckpt.index , and .pb file
                            
                                Converting a series of ints to strings - Why is apply much faster than astype?
                            
                                Get kwargs Inside Function
                            
                                Pipe raw OpenCV images to FFmpeg
                            
                                How to pass arguments to the __code__ of a function?
                            
                                How to define two relationships to the same table in SQLAlchemy
                            
                                How I can make apt-get install to my virtualenv?
                            
                                Why 0 ** 0 equals 1 in python
                            
                                Python split for lists
                            
                                calculate turning points / pivot points in trajectory (path)
                            
                                'ImportError: No module named pytz' when trying to import pylab?
                            
                                TypeError: coercing to Unicode: need string or buffer, int found
                            
                                pandas DataFrame concat / update ("upsert")?
                            
                                PyCharm tells me "Cannot start process, the working directory ... does not exist"
                            
                                Using regex matched groups in pandas dataframe replace function
                            
                                Correct way of handling exceptions in Python?
                            
                                node.js performance with zeromq vs. Python vs. Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With