Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to count number of elements in each column of a matrix that are greater than those in every other column?

I currently have a solution that works. I'm wondering if there is a more elegant approach?

First the setup:

set.seed(315)
mat <- matrix(sample(1:5, 20, replace = TRUE), nrow = 4, ncol = 5)
> mat
     [,1] [,2] [,3] [,4] [,5]
[1,]    3    4    1    3    3
[2,]    5    3    5    1    4
[3,]    4    1    1    4    3
[4,]    3    3    1    1    1

From this matrix, I want to create an 5x5 output matrix, where the entry in i,j is the number of elements in column j that were greater than column i of the input matrix.

edit: Originally I described a solution where the entry i,j of the output solution is the number of elements in column i are greater than column j, but provided the opposite relationship in the output. I changed the description to match the output, and any differences in the answers provided are probably a result of that.

This solution gives the desired output:

mat.pm <- apply(mat, MARGIN = 2,
                function(x) {
                  return(apply(mat, MARGIN = 2, function(y) {
                    return(sum(x > y))
                  }))
                })

> mat.pm
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    1    0    0    0
[2,]    2    0    1    1    2
[3,]    3    2    0    2    2
[4,]    2    3    1    0    1
[5,]    3    2    1    1    0

Is there possibly a better way that doesn't involve double-nested apply functions?

edit: here is some benchmarking for the various approaches:

library(microbenchmark)

set.seed(315)
bm_data <- matrix(sample(1:5, 6000, replace = TRUE), nrow = 200, ncol = 30)

op <- microbenchmark(
  APPLY1 = apply(bm_data, MARGIN = 2,
                function(x) {
                  return(apply(bm_data, MARGIN = 2, function(y) {
                    return(sum(x > y))
                  }))
                }),
  APPLY2 = apply(bm_data, 2 , function(x) colSums( x > bm_data)),
  SWEEP = apply(bm_data,2,function(x) colSums(sweep(bm_data,1,x,"-")<0)),
  VECTORIZE = {
    n <- 1:ncol(bm_data);
    ind <- expand.grid(n, n)
    out <- colSums(bm_data[,c(ind[,2])] > bm_data[,c(ind[,1])])
  },
  SAPPLY = sapply(seq(ncol(bm_data)), function(i) colSums(bm_data[, i] > bm_data)),
  times = 100L
)

> summary(op)
       expr      min        lq    median        uq       max neval
1    APPLY1 9742.091 10519.757 10923.896 11876.614 13006.850   100
2    APPLY2 1012.097  1080.926  1148.111  1247.965  3338.314   100
3     SWEEP 7020.979  7667.972  8580.420  8943.674 33601.336   100
4 VECTORIZE 3036.700  3266.815  3516.449  4476.769 28638.246   100
5    SAPPLY  978.812  1021.754  1078.461  1150.782  3303.798   100

@Ricardo's SAPPLY and @Simon's APPLY2 strategies are both nice, one line solutions that perform much faster that my APPLY1 approach. In terms of elegance, @Simon's update with APPLY2 hits the mark - simple, readable and fast.

One takeaway I learned through discussion here is how much faster apply functions rip through a matrix compared to a data.frame. Convert, then compute if possible.

@Simon's expand.grid is the most creative - I had not even thought to approach the problem in that way. Nice.

like image 954
Christian Lemp Avatar asked Jan 14 '14 17:01

Christian Lemp


4 Answers

Edit:

See below for details but you can do this in a single apply loop:

apply( mat , 2 , function(x) colSums( x > mat )
  

apply is fast here, because is optimised to work on matrices. A lot of the time spent using apply is usually in the conversion of a data.frame to a matrix which is not needed here.


Original:

It's possible to do this entirely vectorised because > has an method for matrices and data.frames. Therefore you can get the indices of columns to compare using expand.grid(), use this to subset the matrix, do the logical comparison and then use colSums to get the result and matrix to wrap it back up to the correct size. All this in 4 lines (really it could be 2):

n <- 1:ncol(mat)
ind <- expand.grid(n,n)
out <- colSums( mat[,c(ind[,1])] > mat[,c(ind[,2])] )
   
matrix( out , ncol(mat) , byrow = TRUE )
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    0    1    0    0    0
#[2,]    2    0    1    1    2
#[3,]    3    2    0    2    2
#[4,]    2    3    1    0    1
#[5,]    3    2    1    1    0

Update:

apply seems even faster, and combining apply with @Ricardo's comparison of the whole matrix leads us to a single, fastest (?) apply solution which is approximately 4 times quicker than the OP:

#  Single apply loop
f1 <- function(mat) apply( mat , 2 , function(x) colSums( x > mat ) )

#  OP double apply loop
f2 <- function(mat) {apply(mat, MARGIN = 2,
                function(x) {
                  return(apply(mat, MARGIN = 2, function(y) {
                    return(sum(x > y))
                  }))})}

require(microbenchmark)
microbenchmark( f1(mat) , f2(mat) )

#Unit: microseconds
#    expr     min       lq   median       uq      max neval
# f1(mat)  95.190  97.6405 102.7145 111.4635  159.584   100
# f2(mat) 361.862 370.7860 398.7830 418.3810 1336.506   100
like image 79
Simon O'Hanlon Avatar answered Sep 24 '22 14:09

Simon O'Hanlon


I think the results you have are transposed:

## This gives you what you show as output
sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat))

## This gives you what you _describe_ in the question
t(sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat)))

     [,1] [,2] [,3] [,4] [,5]
[1,]    0    2    3    2    3
[2,]    1    0    2    3    2
[3,]    0    1    0    1    1
[4,]    0    1    2    0    1
[5,]    0    2    2    1    0
like image 26
Ricardo Saporta Avatar answered Sep 23 '22 14:09

Ricardo Saporta


Here's a way using only 1 apply, but replacing the other with sweep, so not sure it counts:

apply(mat,2,function(x) colSums(sweep(mat,1,x,"-")<0))
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    1    0    0    0
[2,]    2    0    1    1    2
[3,]    3    2    0    2    2
[4,]    2    3    1    0    1
[5,]    3    2    1    1    0
like image 35
James Avatar answered Sep 26 '22 14:09

James


Benchmarking:

 bigmat<-matrix(sample(0:5,200,rep=T),nr=10)
    gridfoo <- function(mat) {
    n <- 1:ncol(mat)
    ind <- expand.grid(n,n)
    out <- colSums( mat[,c(ind[,1])] > mat[,c(ind[,2])] )
    }

    appfoo<- function(mat) apply(mat,2,function(x) colSums(sweep(mat,1,x,"-")<0))


    app2foo<- function(mat) t(sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat)))

 microbenchmark(gridfoo(bigmat),appfoo(bigmat),app2foo(bigmat),times=10)
Unit: microseconds
            expr      min       lq    median       uq      max neval
 gridfoo(bigmat)  363.909  369.895  381.4410  413.086  522.557    10
  appfoo(bigmat) 1208.892 1231.129 1238.1850 1252.083 1521.913    10
 app2foo(bigmat)  298.482  310.883  317.0835  323.284  762.454    10

But...(notice difference in time unit)

Rgames> bigmat<-matrix(sample(0:5,20000,rep=T),nr=100)
Rgames> microbenchmark(gridfoo(bigmat),appfoo(bigmat),app2foo(bigmat),times=10)
Unit: milliseconds
            expr       min        lq   median        uq       max neval
 gridfoo(bigmat) 106.15115 112.98458 149.5746 183.87987 249.35418    10
  appfoo(bigmat) 127.44553 127.92874 132.5372 136.42562 199.12123    10
 app2foo(bigmat)  14.64483  14.99676  18.6089  20.51824  20.91122    10
like image 42
Carl Witthoft Avatar answered Sep 22 '22 14:09

Carl Witthoft