Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chi squared test in Python

I'd like to run a chi-squared test in Python. I've created code to do this, but I don't know if what I'm doing is right, because the scipy docs are quite sparse.

Background first: I have two groups of users. My null hypothesis is that there is no significant difference in whether people in either group are more likely to use desktop, mobile, or tablet.

These are the observed frequencies in the two groups:

[[u'desktop', 14452], [u'mobile', 4073], [u'tablet', 4287]]
[[u'desktop', 30864], [u'mobile', 11439], [u'tablet', 9887]]

Here is my code using scipy.stats.chi2_contingency:

obs = np.array([[14452, 4073, 4287], [30864, 11439, 9887]])
chi2, p, dof, expected = stats.chi2_contingency(obs)
print p

This gives me a p-value of 2.02258737401e-38, which clearly is significant.

My question is: does this code look valid? In particular, I'm not sure whether I should be using scipy.stats.chi2_contingency or scipy.stats.chisquare, given the data I have.

like image 864
Richard Avatar asked Aug 05 '14 12:08

Richard


People also ask

What is chi2_contingency in Python?

chi2_contingency(observed, correction=True, lambda_=None)[source] Chi-square test of independence of variables in a contingency table. This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table [1] observed.

What is Chi-square test in machine learning?

A chi-square test is used in statistics to test the independence of two events. Given the data of two variables, we can get observed count O and expected count E. Chi-Square measures how expected count E and observed count O deviates each other.


2 Answers

I can't comment too much on the use of the function. However, the issue at hand may be statistical in nature. The very small p-value you are seeing is most likely a result of your data containing large frequencies ( in the order of ten thousand). When sample sizes are too large, any differences will become significant - hence the small p-value. The tests you are using are very sensitive to sample size. See here for more details.

like image 51
Luca Terzio Pontiggia Avatar answered Sep 30 '22 15:09

Luca Terzio Pontiggia


You are using chi2_contingency correctly. If you feel uncertain about the appropriate use of a chi-squared test or how to interpret its result (i.e. your question is about statistical testing rather than coding), consider asking it over at the "CrossValidated" site: https://stats.stackexchange.com/

like image 31
Warren Weckesser Avatar answered Sep 30 '22 17:09

Warren Weckesser