Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a R function that applies a function to each pair of columns?

Tags:

I often need to apply a function to each pair of columns in a dataframe/matrix and return the results in a matrix. Now I always write a loop to do this. For instance, to make a matrix containing the p-values of correlations I write:

df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100))  n <- ncol(df)  foo <- matrix(0,n,n)  for ( i in 1:n) {     for (j in i:n)     {         foo[i,j] <- cor.test(df[,i],df[,j])$p.value     } }  foo[lower.tri(foo)] <- t(foo)[lower.tri(foo)]  foo           [,1]      [,2]      [,3] [1,] 0.0000000 0.7215071 0.5651266 [2,] 0.7215071 0.0000000 0.9019746 [3,] 0.5651266 0.9019746 0.0000000 

which works, but is quite slow for very large matrices. I can write a function for this in R (not bothering with cutting time in half by assuming a symmetrical outcome as above):

Papply <- function(x,fun) { n <- ncol(x)  foo <- matrix(0,n,n) for ( i in 1:n) {     for (j in 1:n)     {         foo[i,j] <- fun(x[,i],x[,j])     } } return(foo) } 

Or a function with Rcpp:

library("Rcpp") library("inline")  src <-  ' NumericMatrix x(xR); Function f(fun); NumericMatrix y(x.ncol(),x.ncol());  for (int i = 0; i < x.ncol(); i++) {     for (int j = 0; j < x.ncol(); j++)     {         y(i,j) = as<double>(f(wrap(x(_,i)),wrap(x(_,j))));     } } return wrap(y); '  Papply2 <- cxxfunction(signature(xR="numeric",fun="function"),src,plugin="Rcpp") 

But both are quite slow even on a pretty small dataset of 100 variables ( I thought the Rcpp function would be faster, but I guess conversion between R and C++ all the time takes its toll):

> system.time(Papply(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value))    user  system elapsed     3.73    0.00    3.73  > system.time(Papply2(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value))    user  system elapsed     3.71    0.02    3.75  

So my question is:

  1. Due to the simplicity of these functions I assume this is already somewhere in R. Is there an apply or plyr function that does this? I have looked for it but haven't been able to find it.
  2. If so, is it faster?
like image 410
Sacha Epskamp Avatar asked Mar 08 '11 13:03

Sacha Epskamp


People also ask

How do you apply a function to each column of a Dataframe in R?

Apply any function to all R data frame You can set the MARGIN argument to c(1, 2) or, equivalently, to 1:2 to apply the function to each value of the data frame. If you set MARGIN = c(2, 1) instead of c(1, 2) the output will be the same matrix but transposed. The output is of class “matrix” instead of “data.

How do I apply a function to all values in a column in R?

In R Programming Language to apply a function to every integer type value in a data frame, we can use lapply function from dplyr package. And if the datatype of values is string then we can use paste() with lapply.

How do I apply a function to each row in R?

You can use the apply() function to apply a function to each row in a matrix or data frame in R. where: X: Name of the matrix or data frame. MARGIN: Dimension to perform operation across.

What is apply () in R?

The apply() function lets us apply a function to the rows or columns of a matrix or data frame. This function takes matrix or data frame as an argument along with function and whether it has to be applied by row or column and returns the result in the form of a vector or array or list of values obtained.


2 Answers

It wouldn't be faster, but you can use outer to simplify the code. It does require a vectorized function, so here I've used Vectorize to make a vectorized version of the function to get the correlation between two columns.

df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100)) n <- ncol(df)  corpij <- function(i,j,data) {cor.test(data[,i],data[,j])$p.value} corp <- Vectorize(corpij, vectorize.args=list("i","j")) outer(1:n,1:n,corp,data=df) 
like image 170
Aaron left Stack Overflow Avatar answered Sep 24 '22 06:09

Aaron left Stack Overflow


92% of the time is being spent in cor.test.default and routines it calls so its hopeless trying to get faster results by simply rewriting Papply (other than the savings from computing only those above or below the diagonal assuming that your function is symmetric in x and y).

> M <- matrix(rnorm(100*300),300,100) > Rprof(); junk <- Papply(M,function(x,y) cor.test( x, y)$p.value); Rprof(NULL) > summaryRprof() $by.self                  self.time self.pct total.time total.pct cor.test.default      4.36    29.54      13.56     91.87 # ... snip ... 
like image 35
G. Grothendieck Avatar answered Sep 25 '22 06:09

G. Grothendieck