Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

T-Test in Scipy with NaN values

I have a problem with doing a t-test in scipy that's driving me slowly crazy. It should be simple to resolve, but nothing I do works and there's no solution I can find through extensive searching. I'm using Spyder on the latest distribution of Anaconda.

Specifically: I want to compare means between two columns––'Trait_A' and 'Trait_B'––in a pandas dataframe that I've imported from a csv file. Some of the values in one of the columns are 'Nan' ('Not a Number'). The default setting on the independent samples scipy t-test function doesn't accommodate 'NaN' values. However, setting the 'nan_policy' parameter to 'omit' should deal with this. Nevertheless, when I do, the test statistic and p value come back as 'NaN.' When I restrict the range of values covered to actual numbers, the test works fine. My data and code are below; can anyone suggest what I'm doing wrong? Thanks!

Data:

     Trait_A   Trait_B
0   1.714286  0.000000
1   4.275862  4.000000
2   0.500000  4.625000
3   1.000000  0.000000
4   1.000000  4.000000
5   1.142857  1.000000
6   2.000000  1.000000
7   9.416667  1.956522
8   2.052632  0.571429
9   2.100000  0.166667
10  0.666667  0.000000
11  2.333333  1.705882
12  2.768145       NaN
13  0.000000       NaN
14  6.333333       NaN
15  0.928571       NaN

My code:

import pandas as pd
import scipy.stats as sp
data= pd.read_csv("filepath/Data2.csv")
print (sp.stats.ttest_ind(data['Trait_A'], data['Trait_B'], nan_policy='omit'))      

My result:

Ttest_indResult(statistic=nan, pvalue=nan)
like image 784
Lodore66 Avatar asked May 04 '16 08:05

Lodore66


People also ask

What does NaN mean for P value?

NaN stands for Not a Number and indicates that the data is missing. Observations containing a value of NaN are automatically filtered from any analyses if the question they appear in is numeric.

How do you find the T-test statistic in Python?

The test statistic is the t value and can be calculated using the following formula: t = ( x ¯ 1 − x ¯ 2 ) − D 0 s p 1 n 1 + 1 n 2.

How do you do a one sample T-test in Python?

To perform one-sample t-test we will use the scipy. stats. ttest_1samp() function to perform one- sample t-test. The T-test is calculated for the mean of one set of values.

What does ttest_ind return?

The method ttest_ind() returns the statistic and pvalue of type float array. Let's take an example and compute the T-test of the independent samples by following the below steps: Import the required libraries using the below python code. Define random number generator using np.


Video Answer


2 Answers

It seems like a bug. You can drop nans before passing them to the t-test:

sp.stats.ttest_ind(data.dropna()['Trait_A'], data.dropna()['Trait_B'])
Ttest_indResult(statistic=0.88752464718609214, pvalue=0.38439692093551037)
like image 96
ayhan Avatar answered Sep 19 '22 08:09

ayhan


The bug is in line 3885, in file scipy/scipy/stats/stats.py :

# check both a and b
contains_nan, nan_policy = (_contains_nan(a, nan_policy) or
                            _contains_nan(b, nan_policy))

must be

contains_nan             = (_contains_nan(a, nan_policy)[0] or
                            _contains_nan(b, nan_policy)[0])

swapping 'Trait_A' and 'Trait_B' in your case solve your problem.

like image 31
B. M. Avatar answered Sep 18 '22 08:09

B. M.