Using a 2 sample Kolmogorov Smirnov test, I am getting a p-value of 0.0.
>>>scipy.stats.ks_2samp(dataset1, dataset2)
(0.65296076312083573, 0.0)
Looking at the histograms of the 2 datasets, I am quite confident they represent two different datasets. But, really, p = 0.0? That doesn't seem to make sense. Shouldn't it be a very small but positive number?
I know the return value is of type numpy.float64. Does that have something to do with it?
EDIT: data here: https://www.dropbox.com/s/jpixhz0pcybyh1t/data4stack.csv
scipy.version.full_version
'0.13.2'
It will be the case that if you observed a sample that's impossible under the null (and if the statistic is able to detect that), you can get a p-value of exactly zero.
Practical Data Science using Python The p-value is about the strength of a hypothesis. We build hypothesis based on some statistical model and compare the model's validity using p-value. One way to get the p-value is by using T-test.
scipy.stats.ttest_1samp(a, popmean, axis=0)[source] Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean, popmean. Parameters: a : array_like.
Python3. Output: KS Test is a very powerful way to automatically differentiate samples from a different distribution. kstest function may also be used to check whether the data given follows Normal Distribution or not.
Yes, the probability is very small:
>>> from pprint import pprint
>>> pprint ([(i, scipy.stats.ks_2samp(dataset1, dataset2[:i])[1])
... for i in range(200,len(dataset2),200)])
[(200, 3.1281733251275881e-63),
(400, 3.5780609056448825e-157),
(600, 9.2884803664366062e-225),
(800, 7.1429666685167604e-293),
(1000, 0.0),
(1200, 0.0),
(1400, 0.0),
(1600, 0.0),
(1800, 0.0),
(2000, 0.0),
(2200, 0.0),
(2400, 0.0)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With