p_value is 0 when I use scipy.stats.kstest() for large dataset

Tags:

I have a unique series with there frequencies and want to know if they are from normal distribution so I did a Kolmogorov–Smirnov test using scipy.stats.kstest. Since, to my knowledge, the function takes only a list so I transform the frequencies to a list before I put it into the function. However, the result is weird since the pvalue=0.0

The histogram of the original data and my code are in the followings: Histogram of my dataset

[In]: frequencies = mp[['c','v']]

[In]: print frequencies
         c      v
31  3475.8   18.0
30  3475.6   12.0
29  3475.4   13.0
28  3475.2    8.0
20  3475.0   49.0
14  3474.8   69.0
13  3474.6   79.0
12  3474.4   78.0
11  3474.2   78.0
7   3474.0  151.0
6   3473.8  157.0
5   3473.6  129.0
2   3473.4  149.0
1   3473.2  162.0
0   3473.0  179.0
3   3472.8  145.0
4   3472.6  139.0
8   3472.4   95.0
9   3472.2  103.0
10  3472.0  125.0
15  3471.8   56.0
16  3471.6   75.0
17  3471.4   70.0
18  3471.2   70.0
19  3471.0   57.0
21  3470.8   36.0
22  3470.6   22.0
23  3470.4   20.0
24  3470.2   12.0
25  3470.0   23.0
26  3469.8   13.0
27  3469.6   17.0
32  3469.4    6.0

[In]: testData = map(lambda x: np.repeat(x[0], int(x[1])), frequencies.values)

[In]: testData = list(itertools.chain.from_iterable(testData))

[In]: print len(testData)
2415

[In]: print np.unique(testData)
[ 3469.4  3469.6  3469.8  3470.   3470.2  3470.4  3470.6  3470.8  3471.
  3471.2  3471.4  3471.6  3471.8  3472.   3472.2  3472.4  3472.6  3472.8
  3473.   3473.2  3473.4  3473.6  3473.8  3474.   3474.2  3474.4  3474.6
  3474.8  3475.   3475.2  3475.4  3475.6  3475.8]

[In]: scs.kstest(testData, 'norm')
KstestResult(statistic=1.0, pvalue=0.0)

Thanks everyone at first.

808

asked Oct 11 '17 01:10

Gabriel_F

1 Answers

Using 'norm' for your input will check if the distribution of your data is the same as scipy.stats.norm.cdf with default parameters: loc=0, scale=1.

Instead, you will need to fit a normal distribution to your data and then check if the data and the distribution are the same using the Kolmogorov–Smirnov test.

import numpy as np
from scipy.stats import norm, kstest
import matplotlib.pyplot as plt

freqs = [[3475.8, 18.0], [3475.6, 12.0], [3475.4, 13.0], [3475.2, 8.0], [3475.0, 49.0],
    [3474.8, 69.0], [3474.6, 79.0], [3474.4, 78.0], [3474.2, 78.0], [3474.0, 151.0],
    [3473.8, 157.0], [3473.6, 129.0], [3473.4, 149.0], [3473.2, 162.0], [3473.0, 179.0],
    [3472.8, 145.0], [3472.6, 139.0], [3472.4, 95.0], [3472.2, 103.0], [3472.0, 125.0],
    [3471.8, 56.0], [3471.6, 75.0], [3471.4, 70.0], [3471.2, 70.0], [3471.0, 57.0],
    [3470.8, 36.0], [3470.6, 22.0], [3470.4, 20.0], [3470.2, 12.0], [3470.0, 23.0],
    [3469.8, 13.0], [3469.6, 17.0], [3469.4, 6.0]]

data = np.hstack([np.repeat(x,int(f)) for x,f in freqs])
loc, scale = norm.fit(data)
# create a normal distribution with loc and scale
n = norm(loc=loc, scale=scale)

Plot the fit of the norm to the data:

plt.hist(data, bins=np.arange(data.min(), data.max()+0.2, 0.2), rwidth=0.5)
x = np.arange(data.min(), data.max()+0.2, 0.2)
plt.plot(x, 350*n.pdf(x))
plt.show()

enter image description here

This not a terribly good fit, most due to the long tail on the left. However, you can now run a proper Kolmogorov–Smirnov test using the cdf of the fitted normal distribution

kstest(data, n.cdf)
# returns:
KstestResult(statistic=0.071276854859734784, pvalue=4.0967451653273201e-11)

So we are still rejecting the null hypothesis of the distribution that produced the data being the same as the fitted distribution.

116

answered Oct 07 '22 16:10

James

Related questions
                            
                                Check if merge migration is required, without database
                            
                                Find closest line to each point on big dataset, possibly using shapely and rtree
                            
                                Not able to run Pyinstaller executable on Linux
                            
                                Unpack Optional type annotation in Python 3.5.2
                            
                                SQLAlchemy: Auto generate class definitions for a legacy database
                            
                                Difference between GradientDescentOptimizer and AdamOptimizer in tensorflow?
                            
                                Do I need to scale test data and Dependent variable in the train data?
                            
                                In Django, after a login how can I detect which auth backend authenticated the user?
                            
                                Python Json with returns AttributeError: __enter__
                            
                                how to fail the step explicitly in behave step implementation
                            
                                Pandas Python Groupby Cummulative Sum Reverse
                            
                                specific time format for api documenting using flask restplus
                            
                                user:pass proxies with selenium
                            
                                Convert a dictionary of dictionaries to dictionary of lists
                            
                                Spark DataFrame operators (nunique, multiplication)
                            
                                Count total number of pages in .TIF file in Python
                            
                                How to run same application in new run tool window? Pycharm Community 2017
                            
                                How to configure bokeh plot to have responsive width and fixed height
                            
                                Making your own set of MNIST data (identical to MNIST format)
                            
                                Reshape 4D numpy array into 3D

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

p_value is 0 when I use scipy.stats.kstest() for large dataset

Tags:

python

testing

scipy

normal-distribution

kolmogorov-smirnov

Gabriel_F

People also ask

1 Answers

James

Recent Activity

Donate For Us