I have a problem with doing a t-test in scipy that's driving me slowly crazy. It should be simple to resolve, but nothing I do works and there's no solution I can find through extensive searching. I'm using Spyder on the latest distribution of Anaconda.
Specifically: I want to compare means between two columns––'Trait_A' and 'Trait_B'––in a pandas dataframe that I've imported from a csv file. Some of the values in one of the columns are 'Nan' ('Not a Number'). The default setting on the independent samples scipy t-test function doesn't accommodate 'NaN' values. However, setting the 'nan_policy' parameter to 'omit' should deal with this. Nevertheless, when I do, the test statistic and p value come back as 'NaN.' When I restrict the range of values covered to actual numbers, the test works fine. My data and code are below; can anyone suggest what I'm doing wrong? Thanks!
Data:
Trait_A Trait_B
0 1.714286 0.000000
1 4.275862 4.000000
2 0.500000 4.625000
3 1.000000 0.000000
4 1.000000 4.000000
5 1.142857 1.000000
6 2.000000 1.000000
7 9.416667 1.956522
8 2.052632 0.571429
9 2.100000 0.166667
10 0.666667 0.000000
11 2.333333 1.705882
12 2.768145 NaN
13 0.000000 NaN
14 6.333333 NaN
15 0.928571 NaN
My code:
import pandas as pd
import scipy.stats as sp
data= pd.read_csv("filepath/Data2.csv")
print (sp.stats.ttest_ind(data['Trait_A'], data['Trait_B'], nan_policy='omit'))
My result:
Ttest_indResult(statistic=nan, pvalue=nan)
NaN stands for Not a Number and indicates that the data is missing. Observations containing a value of NaN are automatically filtered from any analyses if the question they appear in is numeric.
The test statistic is the t value and can be calculated using the following formula: t = ( x ¯ 1 − x ¯ 2 ) − D 0 s p 1 n 1 + 1 n 2.
To perform one-sample t-test we will use the scipy. stats. ttest_1samp() function to perform one- sample t-test. The T-test is calculated for the mean of one set of values.
The method ttest_ind() returns the statistic and pvalue of type float array. Let's take an example and compute the T-test of the independent samples by following the below steps: Import the required libraries using the below python code. Define random number generator using np.
It seems like a bug.
You can drop nan
s before passing them to the t-test:
sp.stats.ttest_ind(data.dropna()['Trait_A'], data.dropna()['Trait_B'])
Ttest_indResult(statistic=0.88752464718609214, pvalue=0.38439692093551037)
The bug is in line 3885, in file scipy/scipy/stats/stats.py :
# check both a and b
contains_nan, nan_policy = (_contains_nan(a, nan_policy) or
_contains_nan(b, nan_policy))
must be
contains_nan = (_contains_nan(a, nan_policy)[0] or
_contains_nan(b, nan_policy)[0])
swapping 'Trait_A'
and 'Trait_B'
in your case solve your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With