I am new to chi squared testing and trying to figure out what the 'standard' way of running a chi-squared test and also getting a 95% confidence interval on the difference between success rates in two experiments.
My data looks like this:
Condition A: 25 75 100
Condition B: 100 100 200
Total: 125 175
These numbers represent the counts of what was observed during an experiment. As you can see, the number of samples for Condition A vs. Condition B were not the same.
What I'd like to get is:
It seems like scipy.stats.chisquare
expects the user to adjust the 'expected' counts so that they appear to be taken out of the same sample size as the 'observed' counts. Is that the only transformation I need to do? If not, what else do I need to do? Finally, how would I go about calculating the 95% confidence interval for the difference in proportions?
You have a contingency table. To perform the χ2 test on this data, you can use scipy.stats.chi2_contingency
:
In [31]: from scipy.stats import chi2_contingency
In [32]: obs = np.array([[25, 75], [100, 100]])
In [33]: obs
Out[33]:
array([[ 25, 75],
[100, 100]])
In [34]: chi2, p, dof, expected = chi2_contingency(obs)
In [35]: p
Out[35]: 5.9148695289823149e-05
Your contingency table is 2x2, so you can use Fisher's exact test. This is implemented in scipy as scipy.stats.fisher_exact
:
In [148]: from scipy.stats import fisher_exact
In [149]: oddsr, pval = fisher_exact(obs)
In [150]: pval
Out[150]: 3.7175015403965242e-05
scipy doesn't have much more for contingency tables. It looks like the next release of statsmodels
will have more tools for the analysis of contingency tables, but that doesn't help right now.
It isn't hard to write some code to compute the difference in proportion and its 95% confidence interval. Here's one way:
# Include this if you are using Python 2.7. Or tweak the code in the
# function to ensure that division uses floating point.
from __future__ import division
def diffprop(obs):
"""
`obs` must be a 2x2 numpy array.
Returns:
delta
The difference in proportions
ci
The Wald 95% confidence interval for delta
corrected_ci
Yates continuity correction for the 95% confidence interval of delta.
"""
n1, n2 = obs.sum(axis=1)
prop1 = obs[0,0] / n1
prop2 = obs[1,0] / n2
delta = prop1 - prop2
# Wald 95% confidence interval for delta
se = np.sqrt(prop1*(1 - prop1)/n1 + prop2*(1 - prop2)/n2)
ci = (delta - 1.96*se, delta + 1.96*se)
# Yates continuity correction for confidence interval of delta
correction = 0.5*(1/n1 + 1/n2)
corrected_ci = (ci[0] - correction, ci[1] + correction)
return delta, ci, corrected_ci
For example,
In [22]: obs
Out[22]:
array([[ 25, 75],
[100, 100]])
In [23]: diffprop(obs)
Out[23]:
(-0.25,
(-0.35956733089748971, -0.14043266910251032),
(-0.36706733089748972, -0.13293266910251031))
The first value returned is the difference in proportions delta
. The next two pairs are the Wald 95% confidence interval for delta
, and the Wald 95% confidence interval with Yates continuity correction.
If you don't like those negative values, you can reverse the rows first:
In [24]: diffprop(obs[::-1])
Out[24]:
(0.25,
(0.14043266910251032, 0.35956733089748971),
(0.13293266910251031, 0.36706733089748972))
For comparison, here is a similar calculation in R:
> obs
[,1] [,2]
[1,] 25 75
[2,] 100 100
> prop.test(obs, correct=FALSE)
2-sample test for equality of proportions without continuity
correction
data: obs
X-squared = 17.1429, df = 1, p-value = 3.467e-05
alternative hypothesis: two.sided
95 percent confidence interval:
-0.3595653 -0.1404347
sample estimates:
prop 1 prop 2
0.25 0.50
> prop.test(obs, correct=TRUE)
2-sample test for equality of proportions with continuity correction
data: obs
X-squared = 16.1297, df = 1, p-value = 5.915e-05
alternative hypothesis: two.sided
95 percent confidence interval:
-0.3670653 -0.1329347
sample estimates:
prop 1 prop 2
0.25 0.50
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With