How to get the p-value between two groups after groupby in pandas?

I am stuck on how to apply the custom function to calculate the p-value for two groups obtained from pandas groupby.


test = 0 ==> test
test = 1 ==> control

problem setup

import numpy as np
import pandas as pd
import scipy.stats as ss

N = 15
df = pd.DataFrame({'country': np.random.choice(['A','B','C'],N),
                   'test': np.random.choice([0,1], N),
                   'conversion': np.random.choice([0,1], N),
                   'sex': np.random.choice(['M','F'], N)


ans = df.groupby(['country','test'])['conversion'].agg(['size','mean']).unstack('test')
ans.columns = ['test_size','control_size','test_mean','control_mean']
         test_size  control_size  test_mean  control_mean
A                3             3   0.666667      0.666667
B                1             1   1.000000      1.000000
C                4             3   0.750000      1.000000


Now I want to add two more columns to get the p-value between test and control group. But in my groupby I can only operate on one series at a time and I am not sure how to use two series to get the p-value.

Done so far:

def get_ttest(x,y):
    return stats.ttest_ind(x, y, equal_var=False).pvalue

pseudo code:

['size','mean', some_function_to_get_pvalue])

How to get the p-values columns?

Required Answer

I need the get the values for the column pvalue

         test_size  control_size  test_mean  control_mean  pvalue
A                3             3   0.666667      0.666667   ?
B                1             1   1.000000      1.000000   ?
C                4             3   0.750000      1.000000   ?
You can do this:

import numpy as np
import pandas as pd
import scipy.stats as stats

def get_ttest(x,y,sided=1):
    return stats.ttest_ind(x, y, equal_var=False).pvalue/sided

N = 15
df = pd.DataFrame({'country': np.random.choice(['A','B','C'],N),
                   'test': np.random.choice([0,1], N),
                   'conversion': np.random.choice([0,1], N),
                   'sex': np.random.choice(['M','F'], N)


col_groupby = 'country'
col_test_control = 'test'
col_effect = 'conversion'

a,b = df[col_test_control].unique()

df_pval = df.groupby([col_groupby,col_test_control])\

df_pval.columns = [f'group{a}_size',f'group{b}_size',

df_pval['pvalue'] = df.groupby(col_groupby).apply(lambda dfx: get_ttest(
    dfx.loc[dfx[col_test_control] == a, col_effect],
    dfx.loc[dfx[col_test_control] == b, col_effect]))



         test_size  control_size  test_mean  control_mean    pvalue
A                3             3   0.666667      0.666667  1.000000
B                1             1   1.000000      1.000000       NaN
C                4             3   0.750000      1.000000  0.391002

Test the result

# test for country C
c0 = df.loc[(df.country=='C') & (df.test==0),'conversion']
c1 = df.loc[(df.country=='C') & (df.test==1),'conversion']

pval = stats.ttest_ind(c0, c1, equal_var=False).pvalue
print(pval) # 0.39100221895577053
