I currently have a solution that works. I'm wondering if there is a more elegant approach?
First the setup:
set.seed(315)
mat <- matrix(sample(1:5, 20, replace = TRUE), nrow = 4, ncol = 5)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 3 4 1 3 3
[2,] 5 3 5 1 4
[3,] 4 1 1 4 3
[4,] 3 3 1 1 1
From this matrix, I want to create an 5x5 output matrix, where the entry in i,j is the number of elements in column j that were greater than column i of the input matrix.
edit: Originally I described a solution where the entry i,j of the output solution is the number of elements in column i are greater than column j, but provided the opposite relationship in the output. I changed the description to match the output, and any differences in the answers provided are probably a result of that.
This solution gives the desired output:
mat.pm <- apply(mat, MARGIN = 2,
function(x) {
return(apply(mat, MARGIN = 2, function(y) {
return(sum(x > y))
}))
})
> mat.pm
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 0 0 0
[2,] 2 0 1 1 2
[3,] 3 2 0 2 2
[4,] 2 3 1 0 1
[5,] 3 2 1 1 0
Is there possibly a better way that doesn't involve double-nested apply functions?
edit: here is some benchmarking for the various approaches:
library(microbenchmark)
set.seed(315)
bm_data <- matrix(sample(1:5, 6000, replace = TRUE), nrow = 200, ncol = 30)
op <- microbenchmark(
APPLY1 = apply(bm_data, MARGIN = 2,
function(x) {
return(apply(bm_data, MARGIN = 2, function(y) {
return(sum(x > y))
}))
}),
APPLY2 = apply(bm_data, 2 , function(x) colSums( x > bm_data)),
SWEEP = apply(bm_data,2,function(x) colSums(sweep(bm_data,1,x,"-")<0)),
VECTORIZE = {
n <- 1:ncol(bm_data);
ind <- expand.grid(n, n)
out <- colSums(bm_data[,c(ind[,2])] > bm_data[,c(ind[,1])])
},
SAPPLY = sapply(seq(ncol(bm_data)), function(i) colSums(bm_data[, i] > bm_data)),
times = 100L
)
> summary(op)
expr min lq median uq max neval
1 APPLY1 9742.091 10519.757 10923.896 11876.614 13006.850 100
2 APPLY2 1012.097 1080.926 1148.111 1247.965 3338.314 100
3 SWEEP 7020.979 7667.972 8580.420 8943.674 33601.336 100
4 VECTORIZE 3036.700 3266.815 3516.449 4476.769 28638.246 100
5 SAPPLY 978.812 1021.754 1078.461 1150.782 3303.798 100
@Ricardo's SAPPLY and @Simon's APPLY2 strategies are both nice, one line solutions that perform much faster that my APPLY1 approach. In terms of elegance, @Simon's update with APPLY2 hits the mark - simple, readable and fast.
One takeaway I learned through discussion here is how much faster apply
functions rip through a matrix compared to a data.frame
. Convert, then compute if possible.
@Simon's expand.grid
is the most creative - I had not even thought to approach the problem in that way. Nice.
See below for details but you can do this in a single apply
loop:
apply( mat , 2 , function(x) colSums( x > mat )
apply
is fast here, because is optimised to work on matrices. A lot of the time spent using apply
is usually in the conversion of a data.frame
to a matrix which is not needed here.
It's possible to do this entirely vectorised because >
has an method for matrices
and data.frame
s. Therefore you can get the indices of columns to compare using expand.grid()
, use this to subset the matrix, do the logical comparison and then use colSums
to get the result and matrix
to wrap it back up to the correct size. All this in 4 lines (really it could be 2):
n <- 1:ncol(mat)
ind <- expand.grid(n,n)
out <- colSums( mat[,c(ind[,1])] > mat[,c(ind[,2])] )
matrix( out , ncol(mat) , byrow = TRUE )
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 1 0 0 0
#[2,] 2 0 1 1 2
#[3,] 3 2 0 2 2
#[4,] 2 3 1 0 1
#[5,] 3 2 1 1 0
apply
seems even faster, and combining apply
with @Ricardo's comparison of the whole matrix
leads us to a single, fastest (?) apply
solution which is approximately 4 times quicker than the OP:
# Single apply loop
f1 <- function(mat) apply( mat , 2 , function(x) colSums( x > mat ) )
# OP double apply loop
f2 <- function(mat) {apply(mat, MARGIN = 2,
function(x) {
return(apply(mat, MARGIN = 2, function(y) {
return(sum(x > y))
}))})}
require(microbenchmark)
microbenchmark( f1(mat) , f2(mat) )
#Unit: microseconds
# expr min lq median uq max neval
# f1(mat) 95.190 97.6405 102.7145 111.4635 159.584 100
# f2(mat) 361.862 370.7860 398.7830 418.3810 1336.506 100
I think the results you have are transposed:
## This gives you what you show as output
sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat))
## This gives you what you _describe_ in the question
t(sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat)))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 2 3 2 3
[2,] 1 0 2 3 2
[3,] 0 1 0 1 1
[4,] 0 1 2 0 1
[5,] 0 2 2 1 0
Here's a way using only 1 apply
, but replacing the other with sweep
, so not sure it counts:
apply(mat,2,function(x) colSums(sweep(mat,1,x,"-")<0))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 0 0 0
[2,] 2 0 1 1 2
[3,] 3 2 0 2 2
[4,] 2 3 1 0 1
[5,] 3 2 1 1 0
Benchmarking:
bigmat<-matrix(sample(0:5,200,rep=T),nr=10)
gridfoo <- function(mat) {
n <- 1:ncol(mat)
ind <- expand.grid(n,n)
out <- colSums( mat[,c(ind[,1])] > mat[,c(ind[,2])] )
}
appfoo<- function(mat) apply(mat,2,function(x) colSums(sweep(mat,1,x,"-")<0))
app2foo<- function(mat) t(sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat)))
microbenchmark(gridfoo(bigmat),appfoo(bigmat),app2foo(bigmat),times=10)
Unit: microseconds
expr min lq median uq max neval
gridfoo(bigmat) 363.909 369.895 381.4410 413.086 522.557 10
appfoo(bigmat) 1208.892 1231.129 1238.1850 1252.083 1521.913 10
app2foo(bigmat) 298.482 310.883 317.0835 323.284 762.454 10
But...(notice difference in time unit)
Rgames> bigmat<-matrix(sample(0:5,20000,rep=T),nr=100)
Rgames> microbenchmark(gridfoo(bigmat),appfoo(bigmat),app2foo(bigmat),times=10)
Unit: milliseconds
expr min lq median uq max neval
gridfoo(bigmat) 106.15115 112.98458 149.5746 183.87987 249.35418 10
appfoo(bigmat) 127.44553 127.92874 132.5372 136.42562 199.12123 10
app2foo(bigmat) 14.64483 14.99676 18.6089 20.51824 20.91122 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With