I would like to compare pairs of samples with both Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) tests. I implemented this with scipy.stats.ks_2samp
and scipy.stats.anderson_ksamp
respectively. I would expect a low statistic for similar samples (0 for identical samples) and a higher statistic for more different samples.
In the case of identical samples, and very different samples (no overlap), ks_2samp
provides results as expected, while anderson_ksamp
provides negative values for identical samples and, more importantly, throws an error for very different samples (might be due to the sample size: 200 in the example below).
Here is the code illustrating these findings:
import scipy.stats as stats
import numpy as np
normal1 = np.random.normal(loc=0.0, scale=1.0, size=200)
normal2 = np.random.normal(loc=100, scale=1.0, size=200)
Using KS and AD on identical samples:
sstats.ks_2samp(normal1, normal1)
sstats.anderson_ksamp([normal1, normal1])
Returns respectively:
# Expected
Ks_2sampResult(statistic=0.0, pvalue=1.0)
# Not expected
Anderson_ksampResult(statistic=-1.3196852620954158, critical_values=array([ 0.325, 1.226, 1.961, 2.718, 3.752]), significance_level=1.4357209285296726)
And on the different samples:
sstats.ks_2samp(normal1, normal2)
sstats.anderson_ksamp([normal1, normal2])
Returns respectively:
# Expected
Ks_2sampResult(statistic=1.0, pvalue=1.4175052453413253e-89)
# Not expected
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-757-e3914aaf909c> in <module>()
----> 1 stats.anderson_ksamp([normal1, normal2])
/usr/lib/python3.5/site-packages/scipy/stats/morestats.py in anderson_ksamp(samples, midrank)
1694 warnings.warn("approximate p-value will be computed by extrapolation")
1695
-> 1696 p = math.exp(np.polyval(pf, A2))
1697 return Anderson_ksampResult(A2, critical, p)
1698
OverflowError: math range error
I think these two things actually make some sense. The significance level or p-value in the Anderson-Darling test is extrapolated based on where the test statistic falls within the range of critical values. The further to the right that the test statistic falls, the more significantly you can reject the null hypothesis that they are from the same distribution.
Note that for, say, 80-90 samples using your example distribution parameters, you see the test statistic (for normal1
vs. normal2
) starts to be hugely larger than the largest critical value, which means the extrapolation of the significance is free to grow (hugely, as the exponential of a convex-up quadratic function from polyfit
) towards infinity. So yes, for a large sample size, you'll be computing the exponential of some huge number and getting overflow. In other words, your data is so obviously not from the same distribution, that the significance extrapolation overflows. In such a case, you might bootstrap a smaller data set from your actual data, just to avoid overflow (or bootstrap several times and average the statistic).
On the other end of the spectrum, when the sorted data sets are identical, it looks like some steps of the formula admits the possibility of negative values. Essentially this means the statistic is far to the left of the critical values, indicating a perfect match.
Once again, the significance is calculated by extrapolation, but this time it extrapolates from the test statistic towards the smallest critical value, rather than going from the largest critical value towards the test statistic as for the mismatching case. Since the relative size of the statistic on the left just happens to be smaller (I'm seeing statistics of around -1.3 for using the same sample) relative to the smallest critical value (around 0.3), you get an extrapolation that is "merely" as huge as around 140%, instead of exploding exponentially large numbers ... but still seeing a significance value of 1.4 is a signal that the data is just falling outside the scope where the test can be relevant.
Most likely this is because of the linked line above where k - 1
"degrees of freedom" are subtracted from the calculated test statistic. In the two sample case, this means subtracting 1. So if we add 1 back to the test statistics you're seeing, it puts you in the range of 0.31, which is almost exactly equal to the lowest critical value (which is what you would expect for perfectly identical data, meaning you cannot reject the null hypothesis at even the weakest significance level). So it's probably the degree of freedom adjustment that puts it into the negative end of the spectrum, and then it gets magnified by the hacky quadratic-based p-value extrapolation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With