Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does t-test in Python (scipy, statsmodels) give results different from R, Stata, or Excel?

(problem resolved; x,y and s1,s2 were of different size)

in R:

x <- c(373,398,245,272,238,241,134,410,158,125,198,252,577,272,208,260)
y <- c(411,471,320,364,311,390,163,424,228,144,246,371,680,384,279,303)
t.test(x,y)
t = -1.6229, df = 29.727, p-value = 0.1152

Same numbers are obtained in STATA and Excel

t.test(x,y,alternative="less")
t = -1.6229, df = 29.727, p-value = 0.05758

I cannot replicate the same result using either statsmodels.stats.weightstats.ttest_ind or scipy.stats.ttest_ind no matter which options I try.

statsmodels.stats.weightstats.ttest_ind(s1,s2,alternative="two-sided",usevar="unequal")
(-1.8912081781378358, 0.066740317997990656, 35.666557473974343)

scipy.stats.ttest_ind(s1,s2,equal_var=False)
(array(-1.8912081781378338), 0.066740317997990892)

scipy.stats.ttest_ind(s1,s2,equal_var=True)
(array(-1.8912081781378338), 0.066664507499812745)

There must be thousands of people who use Python to calculate t-test. Are we all getting incorrect results? (I typically rely on Python but this time I checked my results with STATA).

like image 254
Oleg Avatar asked Dec 20 '13 18:12

Oleg


People also ask

Is Statsmodels part of SciPy?

It complements SciPy's stats module. Statsmodels is part of the Python scientific stack that is oriented towards data analysis, data science and statistics. Statsmodels is built on top of the numerical libraries NumPy and SciPy, integrates with Pandas for data handling, and uses Patsy for an R-like formula interface.

How do you run a one sample t test in Python?

To perform one-sample t-test we will use the scipy. stats. ttest_1samp() function to perform one- sample t-test. The T-test is calculated for the mean of one set of values.


2 Answers

That's the result that I get, with default equal var:

>>> x_ = (373,398,245,272,238,241,134,410,158,125,198,252,577,272,208,260)
>>> y_ = (411,471,320,364,311,390,163,424,228,144,246,371,680,384,279,303)

>>> from scipy import stats
>>> stats.ttest_ind(x_, y_)
(array(-1.62292672368488), 0.11506840827144681)

>>> import statsmodels.api as sm
>>> sm.stats.ttest_ind(x_, y_)
(-1.6229267236848799, 0.11506840827144681, 30.0)

and with unequal var:

>>> statsmodels.stats.weightstats.ttest_ind(x_, y_,alternative="two-sided",usevar="unequal")
(-1.6229267236848799, 0.11516398707890187, 29.727196553288369)
>>> stats.ttest_ind(x_, y_, equal_var=False)
(array(-1.62292672368488), 0.11516398707890187)
like image 185
Josef Avatar answered Oct 21 '22 13:10

Josef


The short answer is that the t-tests as provided in Python are the same results as one would get in R and Stata, you just had an additional element in your Python arrays.

I wouldn't bank on Excel's robustness, however.

like image 32
Russia Must Remove Putin Avatar answered Oct 21 '22 14:10

Russia Must Remove Putin