Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the p-value between two groups after groupby in pandas?

I am stuck on how to apply the custom function to calculate the p-value for two groups obtained from pandas groupby.

vocabulary

test = 0 ==> test
test = 1 ==> control

problem setup

import numpy as np
import pandas as pd
import scipy.stats as ss

np.random.seed(100)
N = 15
df = pd.DataFrame({'country': np.random.choice(['A','B','C'],N),
                   'test': np.random.choice([0,1], N),
                   'conversion': np.random.choice([0,1], N),
                   'sex': np.random.choice(['M','F'], N)

                  })


ans = df.groupby(['country','test'])['conversion'].agg(['size','mean']).unstack('test')
ans.columns = ['test_size','control_size','test_mean','control_mean']
         test_size  control_size  test_mean  control_mean
country                                                  
A                3             3   0.666667      0.666667
B                1             1   1.000000      1.000000
C                4             3   0.750000      1.000000

Question

Now I want to add two more columns to get the p-value between test and control group. But in my groupby I can only operate on one series at a time and I am not sure how to use two series to get the p-value.

Done so far:

def get_ttest(x,y):
    return stats.ttest_ind(x, y, equal_var=False).pvalue

pseudo code:

df.groupby(['country','test'])['conversion'].agg(
['size','mean', some_function_to_get_pvalue])

How to get the p-values columns?

Required Answer

I need the get the values for the column pvalue

         test_size  control_size  test_mean  control_mean  pvalue
country                                                  
A                3             3   0.666667      0.666667   ?
B                1             1   1.000000      1.000000   ?
C                4             3   0.750000      1.000000   ?
like image 563
BhishanPoudel Avatar asked Dec 26 '19 16:12

BhishanPoudel


People also ask

What does PD Groupby return?

Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

How do you get a group in a Groupby pandas?

By doing groupby() pandas returns you a dict of grouped DFs. You can easily get the key list of this dict by python built in function keys() .


1 Answers

You can do this:

import numpy as np
import pandas as pd
import scipy.stats as stats

def get_ttest(x,y,sided=1):
    return stats.ttest_ind(x, y, equal_var=False).pvalue/sided

np.random.seed(100)
N = 15
df = pd.DataFrame({'country': np.random.choice(['A','B','C'],N),
                   'test': np.random.choice([0,1], N),
                   'conversion': np.random.choice([0,1], N),
                   'sex': np.random.choice(['M','F'], N)

                  })


col_groupby = 'country'
col_test_control = 'test'
col_effect = 'conversion'

a,b = df[col_test_control].unique()

df_pval = df.groupby([col_groupby,col_test_control])\
            [col_effect].agg(['size','mean']).unstack(col_test_control)

df_pval.columns = [f'group{a}_size',f'group{b}_size',
                   f'group{a}_mean',f'group{b}_mean']

df_pval['pvalue'] = df.groupby(col_groupby).apply(lambda dfx: get_ttest(
    dfx.loc[dfx[col_test_control] == a, col_effect],
    dfx.loc[dfx[col_test_control] == b, col_effect]))


df_pval.pipe(print)

Result

         test_size  control_size  test_mean  control_mean    pvalue
country                                                            
A                3             3   0.666667      0.666667  1.000000
B                1             1   1.000000      1.000000       NaN
C                4             3   0.750000      1.000000  0.391002

Test the result

# test for country C
c0 = df.loc[(df.country=='C') & (df.test==0),'conversion']
c1 = df.loc[(df.country=='C') & (df.test==1),'conversion']

pval = stats.ttest_ind(c0, c1, equal_var=False).pvalue
print(pval) # 0.39100221895577053
like image 168
BhishanPoudel Avatar answered Sep 22 '22 09:09

BhishanPoudel