One of my main tools in my workflows is the do.call(rbind, lapply())
as exampled here in R:
df1 <- data.frame(x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10))
df2 <- data.frame(x1 = rnorm(10, 5), x2 = rnorm(10), x3 = rnorm(10))
getp <- function(var) {
return(t.test(df1[, var], df2[, var])$p.value)
}
list <- c('x1', 'x2', 'x3')
ps <- do.call(rbind, lapply(list, getp))
ps
[,1]
[1,] 6.232025e-09
[2,] 2.128019e-09
[3,] 5.824713e-08
This creates a nice column of p-values. In the real world I would pull out a one row data.frame with each column having useful model stats. With the goal being to iterate over many columns with the same model type and see the fit/effects.
In python, I can create a similar function:
from statsmodels.stats.weightstats import ttest_ind
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'x1' : np.random.randn(10), 'x2' : np.random.randn(10), 'x3' : np.random.randn(10)})
df2 = pd.DataFrame({'x1' : np.random.randn(10)+5, 'x2' : np.random.randn(10)+5, 'x3' : np.random.randn(10)+5})
def getp(var):
print(ttest_ind(df1[var], df2[var])[1])
vars = ['x1', 'x2', 'x3']
I can print all pvalues to the console via:
for i in vars:
getp(i)
9.67944232638e-08
1.82163637251e-08
2.00410346438e-10
But I'd like to save this as an object as one column with three rows similar to in R. Is this possible?
Thanks!
The actual function may look something like this:
def getMoreThanP(var):
out = pd.DataFrame({'mean1' : [np.mean(df1[var])], 'mean2' : [np.mean(df2[var])], 'pvalue' : [ttest_ind(df1[var], df2[var])[1]]})
print(out)
for i in vars:
getMoreThanP(i)
... getMoreThanP(i)
mean1 mean2 pvalue
0 0.24452 4.824327 2.438985e-11
mean1 mean2 pvalue
0 0.187176 4.969862 1.115546e-11
mean1 mean2 pvalue
0 0.035759 5.249378 1.525264e-08
Instead of passing variables one by one, you can pass all three:
ttest_ind(df1[vars], df2[vars])[1]
Out[85]: array([ 4.97835813e-11, 8.30544748e-08, 9.24917262e-07])
The returning object is a one-dimensional array. If you want a dataframe instead
pd.DataFrame(ttest_ind(df1[vars], df2[vars])[1])
This is mainly because ttest_ind accepts array like objects. For getMoreThanP, you can use a combination of pd.concat and map:
def getMoreThanP(var):
out = pd.DataFrame({'mean1' : [np.mean(df1[var])], 'mean2' : [np.mean(df2[var])], 'pvalue' : [ttest_ind(df1[var], df2[var])[1]]})
return out
pd.concat(map(getMoreThanP, vars))
# pd.concat(map(getMoreThanP, vars), ignore_index=True) if you want to reset index
Out[134]:
mean1 mean2 pvalue
0 -0.021791 4.964985 4.978358e-11
0 0.087019 4.610332 8.305447e-08
0 -0.084168 4.680124 9.249173e-07
Note that I changed the definition of getMoreThanP to return the dataframe instead of printing it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With