Tracking down the assumptions made by SciPy's `ttest_ind()` function

Tags:

I'm trying to write my own Python code to compute t-statistics and p-values for one and two tailed independent t tests. I can use the normal approximation, but for the moment I am trying to just use the t-distribution. I've been unsuccessful in matching the results of SciPy's stats library on my test data. I could use a fresh pair of eyes to see if I'm just making a dumb mistake somewhere.

Note, this is cross-posted from Cross-Validated because it's been up for a while over there with no responses, so I thought it can't hurt to also get some software developer opinions. I'm trying to understand if there's an error in the algorithm I'm using, which should reproduce SciPy's result. This is a simple algorithm, so it's puzzling why I can't locate the mistake.

My code:

import numpy as np
import scipy.stats as st

def compute_t_stat(pop1,pop2):

    num1 = pop1.shape[0]; num2 = pop2.shape[0];

    # The formula for t-stat when population variances differ.
    t_stat = (np.mean(pop1) - np.mean(pop2))/np.sqrt( np.var(pop1)/num1 + np.var(pop2)/num2 )

    # ADDED: The Welch-Satterthwaite degrees of freedom.
    df = ((np.var(pop1)/num1 + np.var(pop2)/num2)**(2.0))/(   (np.var(pop1)/num1)**(2.0)/(num1-1) +  (np.var(pop2)/num2)**(2.0)/(num2-1) ) 

    # Am I computing this wrong?
    # It should just come from the CDF like this, right?
    # The extra parameter is the degrees of freedom.

    one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
    two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )    


    # Computing with SciPy's built-ins
    # My results don't match theirs.
    t_ind, p_ind = st.ttest_ind(pop1, pop2)

    return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind

Update:

After reading a bit more on the Welch's t-test, I saw that I should be using the Welch-Satterthwaite formula to calculate degrees of freedom. I updated the code above to reflect this.

With the new degrees of freedom, I get a closer result. My two-sided p-value is off by about 0.008 from the SciPy version's... but this is still much too big an error so I must still be doing something incorrect (or SciPy distribution functions are very bad, but it's hard to believe they are only accurate to 2 decimal places).

Second update:

While continuing to try things, I thought maybe SciPy's version automatically computes the Normal approximation to the t-distribution when the degrees of freedom are high enough (roughly > 30). So I re-ran my code using the Normal distribution instead, and the computed results are actually further away from SciPy's than when I use the t-distribution.

Bonus question :) (More statistical theory related; feel free to ignore)

Also, the t-statistic is negative. I was just wondering what this means for the one-sided t-test. Does this typically mean that I should be looking in the negative axis direction for the test? In my test data, population 1 is a control group who did not receive a certain employment training program. Population 2 did receive it, and the measured data are wage differences before/after treatment.

So I have some reason to think that the mean for population 2 will be larger. But from a statistical theory point of view, it doesn't seem right to concoct a test this way. How could I have known to check (for the one-sided test) in the negative direction without relying on subjective knowledge about the data? Or is this just one of those frequentist things that, while not philosophically rigorous, needs to be done in practice?

256

asked Apr 06 '12 02:04

ely

2 Answers

By using the SciPy built-in function source(), I could see a printout of the source code for the function ttest_ind(). Based on the source code, the SciPy built-in is performing the t-test assuming that the variances of the two samples are equal. It is not using the Welch-Satterthwaite degrees of freedom. SciPy assumes equal variances but does not state this assumption.

I just want to point out that, crucially, this is why you should not just trust library functions. In my case, I actually do need the t-test for populations of unequal variances, and the degrees of freedom adjustment might matter for some of the smaller data sets I will run this on.

As I mentioned in some comments, the discrepancy between my code and SciPy's is about 0.008 for sample sizes between 30 and 400, and then slowly goes to zero for larger sample sizes. This is an effect of the extra (1/n1 + 1/n2) term in the equal-variances t-statistic denominator. Accuracy-wise, this is pretty important, especially for small sample sizes. It definitely confirms to me that I need to write my own function. (Possibly there are other, better Python libraries, but this at least should be known. Frankly, it's surprising this isn't anywhere up front and center in the SciPy documentation for ttest_ind()).

answered Oct 11 '22 14:10

ely

You are not calculating the sample variance, but instead you are using population variances. Sample variance divides by n-1, instead of n. np.var has an optional argument called ddof for reasons similar to this.

This should give you your expected result:

import numpy as np
import scipy.stats as st

def compute_t_stat(pop1,pop2):

    num1 = pop1.shape[0]
    num2 = pop2.shape[0];
    var1 = np.var(pop1, ddof=1)
    var2 = np.var(pop2, ddof=1)

    # The formula for t-stat when population variances differ.
    t_stat = (np.mean(pop1) - np.mean(pop2)) / np.sqrt(var1/num1 + var2/num2)

    # ADDED: The Welch-Satterthwaite degrees of freedom.
    df = ((var1/num1 + var2/num2)**(2.0))/((var1/num1)**(2.0)/(num1-1) + (var2/num2)**(2.0)/(num2-1)) 

    # Am I computing this wrong?
    # It should just come from the CDF like this, right?
    # The extra parameter is the degrees of freedom.

    one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
    two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )    


    # Computing with SciPy's built-ins
    # My results don't match theirs.
    t_ind, p_ind = st.ttest_ind(pop1, pop2)

    return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind

PS: SciPy is open source and mostly implemented with Python. You could have checked the source code for ttest_ind and find out your mistake yourself.

For the bonus side: You don't decide on the side of the one-tail test by looking at your t-value. You decide it beforehand with your hypothesis. If your null hypothesis is that the means are equal and your alternative hypothesis is that the second mean is larger, then your tail should be on the left (negative) side. Because sufficiently small (negative) values of your t-value would indicate that the alternative hypothesis is more likely to be true instead of the null hypothesis.

answered Oct 11 '22 13:10

Avaris

Related questions
                            
                                Global Interpreter Lock and access to data (eg. for NumPy arrays)
                            
                                ring buffer with numpy/ctypes
                            
                                Ordering query result by numeric strings in django (postgres backend)
                            
                                How to get output from gdb.execute in PythonGDB (GDB 7.1)?
                            
                                python and XML: how to place two documents into a single document
                            
                                django csrf_token not printing hidden input field
                            
                                Is there a way to configure a Python logging formatter via config file to log time as Unix timestamp?
                            
                                Matplotlib: Assign Colors to Lines
                            
                                subprocess.Popen and shlex.split formatting in windows and linux
                            
                                Another Simple Random Walk Simulation Using Python(Two-Dimensional)
                            
                                Python 2.X: Why Can't I Properly Handle Unicode?
                            
                                Double or float - optimization routines
                            
                                How to convert array of tamil unicode values into tamil string in python with whitespaces?
                            
                                Name of this algorithm, and is there a numpy/scipy implementation of it?
                            
                                Convert List of Numbers to String Ranges
                            
                                Which spam corpus I can use in NLTK?
                            
                                is there putAll like method for dict in python?
                            
                                matplotlib plot small image without resampling
                            
                                How do I access my database level functions inside other classes/files in Tornado?
                            
                                Creating cx_Freeze exe with Numpy for Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tracking down the assumptions made by SciPy's `ttest_ind()` function

Tags:

python

algorithm

statistics

ely

People also ask

2 Answers

ely

Avaris

Recent Activity

Donate For Us