I am stuck on how to apply the custom function to calculate the p-value for two groups obtained from pandas groupby.
test = 0 ==> test
test = 1 ==> control
import numpy as np
import pandas as pd
import scipy.stats as ss
np.random.seed(100)
N = 15
df = pd.DataFrame({'country': np.random.choice(['A','B','C'],N),
'test': np.random.choice([0,1], N),
'conversion': np.random.choice([0,1], N),
'sex': np.random.choice(['M','F'], N)
})
ans = df.groupby(['country','test'])['conversion'].agg(['size','mean']).unstack('test')
ans.columns = ['test_size','control_size','test_mean','control_mean']
test_size control_size test_mean control_mean
country
A 3 3 0.666667 0.666667
B 1 1 1.000000 1.000000
C 4 3 0.750000 1.000000
Now I want to add two more columns to get the p-value between test and control group. But in my groupby I can only operate on one series at a time and I am not sure how to use two series to get the p-value.
Done so far:
def get_ttest(x,y):
return stats.ttest_ind(x, y, equal_var=False).pvalue
pseudo code:
df.groupby(['country','test'])['conversion'].agg(
['size','mean', some_function_to_get_pvalue])
How to get the p-values columns?
I need the get the values for the column pvalue
test_size control_size test_mean control_mean pvalue
country
A 3 3 0.666667 0.666667 ?
B 1 1 1.000000 1.000000 ?
C 4 3 0.750000 1.000000 ?
Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
By doing groupby() pandas returns you a dict of grouped DFs. You can easily get the key list of this dict by python built in function keys() .
You can do this:
import numpy as np
import pandas as pd
import scipy.stats as stats
def get_ttest(x,y,sided=1):
return stats.ttest_ind(x, y, equal_var=False).pvalue/sided
np.random.seed(100)
N = 15
df = pd.DataFrame({'country': np.random.choice(['A','B','C'],N),
'test': np.random.choice([0,1], N),
'conversion': np.random.choice([0,1], N),
'sex': np.random.choice(['M','F'], N)
})
col_groupby = 'country'
col_test_control = 'test'
col_effect = 'conversion'
a,b = df[col_test_control].unique()
df_pval = df.groupby([col_groupby,col_test_control])\
[col_effect].agg(['size','mean']).unstack(col_test_control)
df_pval.columns = [f'group{a}_size',f'group{b}_size',
f'group{a}_mean',f'group{b}_mean']
df_pval['pvalue'] = df.groupby(col_groupby).apply(lambda dfx: get_ttest(
dfx.loc[dfx[col_test_control] == a, col_effect],
dfx.loc[dfx[col_test_control] == b, col_effect]))
df_pval.pipe(print)
test_size control_size test_mean control_mean pvalue
country
A 3 3 0.666667 0.666667 1.000000
B 1 1 1.000000 1.000000 NaN
C 4 3 0.750000 1.000000 0.391002
# test for country C
c0 = df.loc[(df.country=='C') & (df.test==0),'conversion']
c1 = df.loc[(df.country=='C') & (df.test==1),'conversion']
pval = stats.ttest_ind(c0, c1, equal_var=False).pvalue
print(pval) # 0.39100221895577053
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With