Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:

from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)

It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?

I can fit a model with:

param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))

distribution and empirical data plot

It is very elegant to calculate the Kolmogorov-Smirnov test.

>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)

However, I can't find a good way of calculating the chi-squared test.

There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).

The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.

Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.

like image 713
metakermit Avatar asked Jun 23 '14 16:06

metakermit


People also ask

How do you find Chi-square critical value in Python?

The Chi-Square critical value can be found by using a Chi-Square distribution table or by using statistical software. To find the Chi-Square critical value, you need: A significance level (common choices are 0.01, 0.05, and 0.10) Degrees of freedom.

How do you calculate goodness of fit in Python?

chisquare() function. In this approach we use stats. chisquare() method from the scipy. stats module which helps us determine chi-square goodness of fit statistic and p-value.


2 Answers

An approximate solution for equal probability bins:

  • Estimate the parameters of the distribution
  • Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
  • Then, use np.histogram to count the number of observations in each bin

then use chisquare test on the frequencies.

An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.

This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.

I haven't looked into this into a long time. If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.

like image 99
Josef Avatar answered Oct 20 '22 18:10

Josef


Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.

It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy

like image 26
gung - Reinstate Monica Avatar answered Oct 20 '22 20:10

gung - Reinstate Monica