I often need to apply a function to each pair of columns in a dataframe/matrix and return the results in a matrix. Now I always write a loop to do this. For instance, to make a matrix containing the p-values of correlations I write:
df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100)) n <- ncol(df) foo <- matrix(0,n,n) for ( i in 1:n) { for (j in i:n) { foo[i,j] <- cor.test(df[,i],df[,j])$p.value } } foo[lower.tri(foo)] <- t(foo)[lower.tri(foo)] foo [,1] [,2] [,3] [1,] 0.0000000 0.7215071 0.5651266 [2,] 0.7215071 0.0000000 0.9019746 [3,] 0.5651266 0.9019746 0.0000000
which works, but is quite slow for very large matrices. I can write a function for this in R (not bothering with cutting time in half by assuming a symmetrical outcome as above):
Papply <- function(x,fun) { n <- ncol(x) foo <- matrix(0,n,n) for ( i in 1:n) { for (j in 1:n) { foo[i,j] <- fun(x[,i],x[,j]) } } return(foo) }
Or a function with Rcpp:
library("Rcpp") library("inline") src <- ' NumericMatrix x(xR); Function f(fun); NumericMatrix y(x.ncol(),x.ncol()); for (int i = 0; i < x.ncol(); i++) { for (int j = 0; j < x.ncol(); j++) { y(i,j) = as<double>(f(wrap(x(_,i)),wrap(x(_,j)))); } } return wrap(y); ' Papply2 <- cxxfunction(signature(xR="numeric",fun="function"),src,plugin="Rcpp")
But both are quite slow even on a pretty small dataset of 100 variables ( I thought the Rcpp function would be faster, but I guess conversion between R and C++ all the time takes its toll):
> system.time(Papply(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value)) user system elapsed 3.73 0.00 3.73 > system.time(Papply2(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value)) user system elapsed 3.71 0.02 3.75
So my question is:
plyr
function that does this? I have looked for it but haven't been able to find it.Apply any function to all R data frame You can set the MARGIN argument to c(1, 2) or, equivalently, to 1:2 to apply the function to each value of the data frame. If you set MARGIN = c(2, 1) instead of c(1, 2) the output will be the same matrix but transposed. The output is of class “matrix” instead of “data.
In R Programming Language to apply a function to every integer type value in a data frame, we can use lapply function from dplyr package. And if the datatype of values is string then we can use paste() with lapply.
You can use the apply() function to apply a function to each row in a matrix or data frame in R. where: X: Name of the matrix or data frame. MARGIN: Dimension to perform operation across.
The apply() function lets us apply a function to the rows or columns of a matrix or data frame. This function takes matrix or data frame as an argument along with function and whether it has to be applied by row or column and returns the result in the form of a vector or array or list of values obtained.
It wouldn't be faster, but you can use outer
to simplify the code. It does require a vectorized function, so here I've used Vectorize
to make a vectorized version of the function to get the correlation between two columns.
df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100)) n <- ncol(df) corpij <- function(i,j,data) {cor.test(data[,i],data[,j])$p.value} corp <- Vectorize(corpij, vectorize.args=list("i","j")) outer(1:n,1:n,corp,data=df)
92% of the time is being spent in cor.test.default
and routines it calls so its hopeless trying to get faster results by simply rewriting Papply
(other than the savings from computing only those above or below the diagonal assuming that your function is symmetric in x
and y
).
> M <- matrix(rnorm(100*300),300,100) > Rprof(); junk <- Papply(M,function(x,y) cor.test( x, y)$p.value); Rprof(NULL) > summaryRprof() $by.self self.time self.pct total.time total.pct cor.test.default 4.36 29.54 13.56 91.87 # ... snip ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With