Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run a chi square test with observation and expectation counts and get confidence interval

I am new to chi squared testing and trying to figure out what the 'standard' way of running a chi-squared test and also getting a 95% confidence interval on the difference between success rates in two experiments.

My data looks like this:

Condition A:    25           75           100
Condition B:    100          100          200
Total:          125          175

These numbers represent the counts of what was observed during an experiment. As you can see, the number of samples for Condition A vs. Condition B were not the same.

What I'd like to get is:

  1. A test statistic indicating whether Condition A's success rate of 33% is statistically different from Condition B's success rate of 50%.
  2. I'd also like to get a 95% confidence interval on the difference between the two success rates.

It seems like scipy.stats.chisquare expects the user to adjust the 'expected' counts so that they appear to be taken out of the same sample size as the 'observed' counts. Is that the only transformation I need to do? If not, what else do I need to do? Finally, how would I go about calculating the 95% confidence interval for the difference in proportions?

like image 863
helloB Avatar asked Dec 04 '22 23:12

helloB


1 Answers

You have a contingency table. To perform the χ2 test on this data, you can use scipy.stats.chi2_contingency:

In [31]: from scipy.stats import chi2_contingency

In [32]: obs = np.array([[25, 75], [100, 100]])

In [33]: obs
Out[33]: 
array([[ 25,  75],
       [100, 100]])

In [34]: chi2, p, dof, expected = chi2_contingency(obs)

In [35]: p
Out[35]: 5.9148695289823149e-05

Your contingency table is 2x2, so you can use Fisher's exact test. This is implemented in scipy as scipy.stats.fisher_exact:

In [148]: from scipy.stats import fisher_exact

In [149]: oddsr, pval = fisher_exact(obs)

In [150]: pval
Out[150]: 3.7175015403965242e-05

scipy doesn't have much more for contingency tables. It looks like the next release of statsmodels will have more tools for the analysis of contingency tables, but that doesn't help right now.

It isn't hard to write some code to compute the difference in proportion and its 95% confidence interval. Here's one way:

# Include this if you are using Python 2.7.  Or tweak the code in the
# function to ensure that division uses floating point.
from __future__ import division


def diffprop(obs):
    """
    `obs` must be a 2x2 numpy array.

    Returns:
    delta
        The difference in proportions
    ci
        The Wald 95% confidence interval for delta
    corrected_ci
        Yates continuity correction for the 95% confidence interval of delta.
    """
    n1, n2 = obs.sum(axis=1)
    prop1 = obs[0,0] / n1
    prop2 = obs[1,0] / n2
    delta = prop1 - prop2

    # Wald 95% confidence interval for delta
    se = np.sqrt(prop1*(1 - prop1)/n1 + prop2*(1 - prop2)/n2)
    ci = (delta - 1.96*se, delta + 1.96*se)

    # Yates continuity correction for confidence interval of delta
    correction = 0.5*(1/n1 + 1/n2)
    corrected_ci = (ci[0] - correction, ci[1] + correction)

    return delta, ci, corrected_ci

For example,

In [22]: obs
Out[22]: 
array([[ 25,  75],
       [100, 100]])

In [23]: diffprop(obs)
Out[23]: 
(-0.25,
 (-0.35956733089748971, -0.14043266910251032),
 (-0.36706733089748972, -0.13293266910251031))

The first value returned is the difference in proportions delta. The next two pairs are the Wald 95% confidence interval for delta, and the Wald 95% confidence interval with Yates continuity correction.

If you don't like those negative values, you can reverse the rows first:

In [24]: diffprop(obs[::-1])
Out[24]: 
(0.25,
 (0.14043266910251032, 0.35956733089748971),
 (0.13293266910251031, 0.36706733089748972))

For comparison, here is a similar calculation in R:

> obs
     [,1] [,2]
[1,]   25   75
[2,]  100  100
> prop.test(obs, correct=FALSE)

    2-sample test for equality of proportions without continuity
    correction

data:  obs
X-squared = 17.1429, df = 1, p-value = 3.467e-05
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.3595653 -0.1404347
sample estimates:
prop 1 prop 2 
  0.25   0.50 

> prop.test(obs, correct=TRUE)

    2-sample test for equality of proportions with continuity correction

data:  obs
X-squared = 16.1297, df = 1, p-value = 5.915e-05
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.3670653 -0.1329347
sample estimates:
prop 1 prop 2 
  0.25   0.50
like image 71
Warren Weckesser Avatar answered Dec 08 '22 15:12

Warren Weckesser