Python, Pandas & Chi-Squared Test of Independence

Tags:

I am quite new to Python as well as Statistics. I'm trying to apply the Chi Squared Test to determine whether previous success affects the level of change of a person (percentage wise, this does seem to be the case, but I wanted to see whether my results were statistically significant).

My question is: Did I do this correctly? My results say the p-value is 0.0, which means that there is a significant relationship between my variables (which is what I want of course...but 0 seems a little bit too perfect for a p-value, so I'm wondering whether I did it incorrectly coding wise).

Here's what I did:

import numpy as np
import pandas as pd
import scipy.stats as stats

d = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']),
 'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']),
 'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}

total_summarized = pd.DataFrame(d)

observed = total_summarized.ix[0:2,0:2]

Output: Observed

expected =  np.outer(total_summarized["row_totals"][0:2],
                 total_summarized.ix["col_totals"][0:2])/1000

expected = pd.DataFrame(expected)

expected.columns = ["Previously Successful","Previously Unsuccessful"]
expected.index = ["Yes - changed strategy","No"]

chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print(chi_squared_stat)

crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                  df = 8)   # *

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                         df=8)
print("P value")
print(p_value)

stats.chi2_contingency(observed= observed)

Output Statistics

694

asked May 14 '17 11:05

Mia

1 Answers

A few corrections:

Your expected array is not correct. You must divide by observed.sum().sum(), which is 1284, not 1000.
For a 2x2 contingency table such as this, the degrees of freedom is 1, not 8.
Your calculation of chi_squared_stat does not include a continuity correction. (But it isn't necessarily wrong to not use it--that's a judgment call for the statistician.)

All the calculations that you perform (expected matrix, statistics, degrees of freedom, p-value) are computed by chi2_contingency:

In [65]: observed
Out[65]: 
                        Previously Successful  Previously Unsuccessful
Yes - changed strategy                  129.3                   260.17
No                                      182.7                   711.83

In [66]: from scipy.stats import chi2_contingency

In [67]: chi2, p, dof, expected = chi2_contingency(observed)

In [68]: chi2
Out[68]: 23.383138325890453

In [69]: p
Out[69]: 1.3273696199438626e-06

In [70]: dof
Out[70]: 1

In [71]: expected
Out[71]: 
array([[  94.63757009,  294.83242991],
       [ 217.36242991,  677.16757009]])

By default, chi2_contingency uses a continuity correction when the contingency table is 2x2. If you prefer to not use the correction, you can disable it with the argument correction=False:

In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)

In [74]: chi2
Out[74]: 24.072616672232893

In [75]: p
Out[75]: 9.2770200776879643e-07

answered Nov 08 '22 20:11

Warren Weckesser

Related questions
                            
                                Generating signed session cookie value used in Flask
                            
                                Can't import plotly.figure_factory
                            
                                Pandas: Find previous row of matching value
                            
                                Python local variable compile principle
                            
                                How to redirect 404 requests to homepage in Django single page app using Nginx?
                            
                                probability density histogram with Matplotlib doesnt make sense
                            
                                Calculate DATEDIFF in POSTGRES using SQLAlchemy
                            
                                How to append a NumPy array to a NumPy array
                            
                                Converting list to dict python
                            
                                How to avoid auto escaping HTML tags with Jinja2
                            
                                How can I pass keyword arguments as parameters to a function?
                            
                                How setup.py install npm module?
                            
                                Including missing combinations of values in a pandas groupby aggregation
                            
                                Replace missing values in all columns except one in pandas dataframe
                            
                                Multiple select in wagtail admin
                            
                                Python subprocess argument with equal sign and space
                            
                                Why is partition key column missing from DataFrame
                            
                                Inspect and Parse KML with pyKML
                            
                                Python Flask date update real-time
                            
                                Add fields dynamically to WTForms form

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python, Pandas & Chi-Squared Test of Independence

Tags:

python

pandas

numpy

statistics

scipy

Mia

People also ask

1 Answers

Warren Weckesser

Recent Activity

Donate For Us