I have 3 dataframes containing 7 columns.
df_a
df_b
df_c
df_a.head()
VSPD1_perc VSPD2_perc VSPD3_perc VSPD4_perc VSPD5_perc VSPD6_perc \
0 NaN NaN NaN NaN NaN NaN
3 0.189588 0.228052 0.268460 0.304063 0.009837 0
5 0.134684 0.242556 0.449054 0.168816 0.004890 0
9 0.174806 0.232150 0.381936 0.211108 0.000000 0
11 NaN NaN NaN NaN NaN NaN
VSPD7_perc
0 NaN
3 0
5 0
9 0
11 NaN
My goal is to produce a matrix or a dataframe with the resulting p values from a t-test, and test dataframes df_b and df_c against df_a, column for column. That is test column 1 in df_b and df_c against column 1 in df_a. I would like to use dataframe (df_a) as a standard to make a statistical t test against. I have found the statistical test in statsmodels (stat.ttest_ind(x1, x2)), but I need help on making a matrix out of the p values from the test. Does anyone know how to do this...
Leaving aside proper NaN management, you can do it as simply as t, p = scipy.stats.ttest_ind(df_a.dropna(axis=0), df_b.dropna(axis=0))
.
See demo:
>>> import pandas as pd
>>> import scipy.stats
>>> import numpy as np
>>> df_a = pd.read_clibpoard()
>>> df_b = df_a + np.random.randn(5, 7)
>>> df_c = df_a + np.random.randn(5, 7)
>>> _, p_b = scipy.stats.ttest_ind(df_a.dropna(axis=0), df_b.dropna(axis=0))
>>> _, p_c = scipy.stats.ttest_ind(df_a.dropna(axis=0), df_c.dropna(axis=0))
>>> pd.DataFrame([p_b, p_c], columns = df_a.columns, index = ['df_b', 'df_c'])
VSPD1_perc VSPD2_perc VSPD3_perc VSPD4_perc VSPD5_perc VSPD6_perc \
df_b 0.425286 0.987956 0.644236 0.552244 0.432640 0.624528
df_c 0.947182 0.911384 0.189283 0.828780 0.697709 0.166956
VSPD7_perc
df_b 0.546648
df_c 0.206950
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With