Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to count the occurrences of each unique column in a matrix in R

I'm new to R (and to stackoverflow) and I would appreciate your help. I would like to count the number of occurences of each unique column in a matrix. I have written the following code, but it is extremely slow :

frequencyofequalcolumnsinmatrix = function(matrixM){

# returns a matrix columnswithfrequencyofmtxM that contains each distinct column and the frequency of each distinct columns on the last row. Hence  if the last row is c(3,5,3,2), then matrixM has 3+5+3+2=13 columns; there are 4 distinct columns; and the first distinct column appears 3 times, the second distinct column appears 5 times, etc.


n = nrow(matrixM)

columnswithfrequencyofmtxM = c()

while (ncol(matrixM)>0){

  indexzero = which(apply(matrixM-matrixM[,1], 2, function(x) identical(as.vector(x),rep(0,n))));

  indexnotzero = setdiff(seq(1:ncol(matrixM)),indexzero);

  frequencyofgivencolumn = c(matrixM[,1], length(indexzero)); #vector of length n. Coordinates 1 to nrow(matrixM) contains the coordinates of the given distinct column while coordinate nrow(matrixM)+1 contains the frequency of appearance of that column

  columnswithfrequencyofmtxM = cbind(columnswithfrequencyofmtxM,frequencyofgivencolumn, deparse.level=0);

  matrixM=matrixM[,indexnotzero];

  matrixM = as.matrix(matrixM);

  }

return(columnswithfrequencyofmtxM)


} 

If we apply on the matrix 'testmtx', we obtain:

> testmtx = matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)
> frequencyofequalcolumnsinmatrix(testmtx)
     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    2    1    2
[3,]    4    1    1
[4,]    2    3    1

where the last row contains the number of occurrences of the column above.

Unhappy with my code, I browsed through stackoverflow. I found the following Question:

Fastest way to count occurrences of each unique element

It is shown that the fastest way to count occurrences of each unique element of a vector is through the use of the data.table() package. Here is the code:

f6 <- function(x){
data.table(x)[, .N, keyby = x]
}

When we run it we obtain:

> vtr = c(1,2,3,1,1,2,4,2,4)
> f6(vtr)
   x N
1: 1 3
2: 2 3
3: 3 1
4: 4 2

I have tried to modify this code in order to use it in my case. This requires to be able to create vtr as a vector in which each element is a vector. But I haven't been able to do that.(Most likely because in R, c(c(1,2),c(3,4)) is the same as c(1,2,3,4)).

Should I try to modify the function f6? If so, how?
Or should I take a completely different approach? IF so, which one?

Thank you!

like image 765
Gaël Giordano Avatar asked Mar 17 '23 01:03

Gaël Giordano


1 Answers

One simple way would be to just paste your rows together in to a vector and then use the function.

mat <- matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)

vec <- apply(mat, 2, paste, collapse=" ")

f6(vec)
     x N
1: 011 3
2: 121 1
3: 124 2

EDIT

The answer by @RohitDas made me think, when thinking about performance it is always best to check. If I take all the functions previously shown in the question the OP linked here and add

f7 <- table

Also adding f10 suggestion by @DavidArenburg

f10 <- function(x){ 
  table(unlist(data.table(x)[, lapply(.SD, paste, collapse = "")])) 
}

Here are the results:

After adding the solution by @MaratTalipov, it is the clear winner. Applied directly on the matrix it is faster than all the vector solutions.

set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=1000)

microbenchmark(
   f1(apply(testmx, 2, paste, collapse=" ")),
   f2(apply(testmx, 2, paste, collapse=" ")),
   f3(apply(testmx, 2, paste, collapse=" ")),
   f4(apply(testmx, 2, paste, collapse=" ")),
   f5(apply(testmx, 2, paste, collapse=" ")),
   f6(apply(testmx, 2, paste, collapse=" ")),
   f7(apply(testmx, 2, paste, collapse=" ")),
   f8(apply(testmx, 2, paste, collapse=" ")),
   f9(apply(testmx, 2, paste, collapse=" ")),
   f10(testmx),
   f11(testmx),
   f12(testmx)
   )
Unit: microseconds
                                       expr      min        lq      mean   median        uq       max neval
 f1(apply(testmx, 2, paste, collapse = " ")) 3311.770 3511.5620 3901.0020 3612.035 3849.3600  9569.987   100
 f2(apply(testmx, 2, paste, collapse = " ")) 3044.997 3263.6515 3667.9232 3430.914 3847.2430  6721.318   100
 f3(apply(testmx, 2, paste, collapse = " ")) 2032.179 2118.0245 2371.8638 2213.301 2430.4155  6631.624   100
 f4(apply(testmx, 2, paste, collapse = " ")) 2119.949 2218.3050 2497.1513 2286.442 2425.0260  6258.987   100
 f5(apply(testmx, 2, paste, collapse = " ")) 2131.498 2221.5775 2459.9300 2309.925 2530.3115  4222.575   100
 f6(apply(testmx, 2, paste, collapse = " ")) 3121.217 3367.7815 3738.3239 3486.155 3835.1175  7979.352   100
 f7(apply(testmx, 2, paste, collapse = " ")) 1766.175 1832.9650 2040.5483 1889.169 2032.1795  3784.110   100
 f8(apply(testmx, 2, paste, collapse = " ")) 2085.303 2169.2240 2435.6932 2237.168 2404.2380  5002.109   100
 f9(apply(testmx, 2, paste, collapse = " ")) 2802.090 2988.0230 3449.0685 3056.930 3373.1710 17640.957   100
                                f10(testmx) 4027.017 4251.6385 4865.7036 4399.461 4848.7035 11811.581   100
                                f11(testmx)  500.058  549.1395  624.9526  576.279  636.1395  1176.809   100
                                f12(testmx) 1827.769 1886.4740 1957.0555 1902.834 1964.4270  3600.487   100
like image 124
cdeterman Avatar answered Mar 19 '23 16:03

cdeterman