Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python equivalent for do.call(rbind, lapply()) from R

One of my main tools in my workflows is the do.call(rbind, lapply()) as exampled here in R:

df1 <- data.frame(x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10))
df2 <- data.frame(x1 = rnorm(10, 5), x2 = rnorm(10), x3 = rnorm(10))

getp <- function(var) {
  return(t.test(df1[, var], df2[, var])$p.value)
}

list <- c('x1', 'x2', 'x3')
ps <- do.call(rbind, lapply(list, getp))
ps
                 [,1]
[1,] 6.232025e-09
[2,] 2.128019e-09
[3,] 5.824713e-08

This creates a nice column of p-values. In the real world I would pull out a one row data.frame with each column having useful model stats. With the goal being to iterate over many columns with the same model type and see the fit/effects.

In python, I can create a similar function:

from statsmodels.stats.weightstats import ttest_ind 
import numpy as np
import pandas as pd

df1 = pd.DataFrame({'x1' : np.random.randn(10), 'x2' : np.random.randn(10), 'x3' : np.random.randn(10)}) 
df2 = pd.DataFrame({'x1' : np.random.randn(10)+5, 'x2' : np.random.randn(10)+5, 'x3' : np.random.randn(10)+5}) 
def getp(var):
    print(ttest_ind(df1[var], df2[var])[1])

vars = ['x1', 'x2', 'x3']

I can print all pvalues to the console via:

for i in vars:
    getp(i)

9.67944232638e-08
1.82163637251e-08
2.00410346438e-10

But I'd like to save this as an object as one column with three rows similar to in R. Is this possible?

Thanks!

The actual function may look something like this:

def getMoreThanP(var):
    out = pd.DataFrame({'mean1' : [np.mean(df1[var])], 'mean2' : [np.mean(df2[var])], 'pvalue' : [ttest_ind(df1[var], df2[var])[1]]})
    print(out)

for i in vars:
    getMoreThanP(i)

...     getMoreThanP(i)
     mean1     mean2        pvalue
0  0.24452  4.824327  2.438985e-11
      mean1     mean2        pvalue
0  0.187176  4.969862  1.115546e-11
      mean1     mean2        pvalue
0  0.035759  5.249378  1.525264e-08
like image 539
Andrew Taylor Avatar asked Jun 08 '16 17:06

Andrew Taylor


1 Answers

Instead of passing variables one by one, you can pass all three:

ttest_ind(df1[vars], df2[vars])[1]
Out[85]: array([  4.97835813e-11,   8.30544748e-08,   9.24917262e-07])

The returning object is a one-dimensional array. If you want a dataframe instead

pd.DataFrame(ttest_ind(df1[vars], df2[vars])[1])

This is mainly because ttest_ind accepts array like objects. For getMoreThanP, you can use a combination of pd.concat and map:

def getMoreThanP(var):
    out = pd.DataFrame({'mean1' : [np.mean(df1[var])], 'mean2' : [np.mean(df2[var])], 'pvalue' : [ttest_ind(df1[var], df2[var])[1]]})
    return out

pd.concat(map(getMoreThanP, vars))
# pd.concat(map(getMoreThanP, vars), ignore_index=True) if you want to reset index
Out[134]: 
      mean1     mean2        pvalue
0 -0.021791  4.964985  4.978358e-11
0  0.087019  4.610332  8.305447e-08
0 -0.084168  4.680124  9.249173e-07

Note that I changed the definition of getMoreThanP to return the dataframe instead of printing it.

like image 145
ayhan Avatar answered Nov 11 '22 04:11

ayhan