Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANOVA for groups within a dataframe using scipy

I have a dataframe as follows. I need to do ANOVA on this between three conditions. The dataframe looks like:

data0 = pd.DataFrame({'Names': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007','AC007'], 
    'value': [22, 22, 2, 2, 2,5], 
    'condition':['NON', 'NON', 'YES', 'YES', 'RE','RE']})

I need to do ANOVA test between YES and NON, NON and RE and YES and RE, conditions from conditions for Names. I know I could do it like this,

NON=df.query('condition =="NON"and Names=="CTA15"')
no=df.value
YES=df.query('condition =="YES"and Names=="CTA15"')    
Y=YES.value

Then perform one way ANOVA as following,

    from scipy import stats                
    f_val, p_val = stats.f_oneway(no, Y)            
    print ("One-way ANOVA P =", p_val )

But would be great if there is any elegant solution as my initial data frame is big and has many names and conditions to compare between

like image 462
user1017373 Avatar asked May 19 '17 08:05

user1017373


1 Answers

Consider the following sample DataFrame:

df = pd.DataFrame({'Names': np.random.randint(1, 10, 1000), 
                   'value': np.random.randn(1000), 
                   'condition': np.random.choice(['NON', 'YES', 'RE'], 1000)})

df.head()
Out: 
   Names condition     value
0      4        RE  0.844120
1      4       NON -0.440285
2      5       YES  0.559497
3      4        RE  0.472425
4      9       YES  0.205906

The following groups the DataFrame by Names, and then passes each condition group to ANOVA:

import scipy.stats as ss
for name_group in df.groupby('Names'):
    samples = [condition[1] for condition in name_group[1].groupby('condition')['value']]
    f_val, p_val = ss.f_oneway(*samples)
    print('Name: {}, F value: {:.3f}, p value: {:.3f}'.format(name_group[0], f_val, p_val))

Name: 1, F value: 0.138, p value: 0.871
Name: 2, F value: 1.458, p value: 0.237
Name: 3, F value: 0.742, p value: 0.479
Name: 4, F value: 2.718, p value: 0.071
Name: 5, F value: 0.255, p value: 0.776
Name: 6, F value: 1.731, p value: 0.182
Name: 7, F value: 0.269, p value: 0.764
Name: 8, F value: 0.474, p value: 0.624
Name: 9, F value: 1.226, p value: 0.297

For post-hoc tests, you can use statsmodels (as explained here):

from statsmodels.stats.multicomp import pairwise_tukeyhsd
for name, grouped_df in df.groupby('Names'):
    print('Name {}'.format(name), pairwise_tukeyhsd(grouped_df['value'], grouped_df['condition']))
Name 1 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.0086  -0.5129 0.5301 False 
 NON    YES    0.0084  -0.4817 0.4986 False 
  RE    YES   -0.0002  -0.5217 0.5214 False 
--------------------------------------------
Name 2 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.0089  -0.5299 0.5121 False 
 NON    YES    0.083   -0.4182 0.5842 False 
  RE    YES    0.0919  -0.4008 0.5846 False 
--------------------------------------------
Name 3 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.2401  -0.3136 0.7938 False 
 NON    YES    0.2765  -0.2903 0.8432 False 
  RE    YES    0.0364  -0.5052 0.578  False 
--------------------------------------------
Name 4 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.0894  -0.5825 0.7613 False 
 NON    YES   -0.0437  -0.7418 0.6544 False 
  RE    YES   -0.1331  -0.6949 0.4287 False 
--------------------------------------------
Name 5 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.4264  -0.9495 0.0967 False 
 NON    YES    0.0439  -0.4264 0.5142 False 
  RE    YES    0.4703  -0.0155 0.9561 False 
--------------------------------------------
Name 6 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.0649  -0.4971 0.627  False 
 NON    YES    -0.406  -0.9405 0.1285 False 
  RE    YES   -0.4709  -1.0136 0.0717 False 
--------------------------------------------
Name 7 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE    0.3111  -0.2766 0.8988 False 
 NON    YES   -0.1664  -0.7314 0.3987 False 
  RE    YES   -0.4774  -1.0688 0.114  False 
--------------------------------------------
Name 8 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.0224   -0.668 0.6233 False 
 NON    YES    0.0119   -0.668 0.6918 False 
  RE    YES    0.0343  -0.6057 0.6742 False 
--------------------------------------------
Name 9 Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
 NON     RE   -0.2414  -0.7792 0.2963 False 
 NON    YES    0.0696  -0.5746 0.7138 False 
  RE    YES    0.311   -0.3129 0.935  False 
like image 156
ayhan Avatar answered Nov 04 '22 05:11

ayhan