Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R combinations, looking for faster and more efficient way(package,code,parallel cpu) than basic R

I'm using basic R for combinations.

For example, let's say I have a matrix with 2 rows and 5 columns:

 z<-matrix(c(1, 2, 1, 3, 2, 2, 1, 3, 2, 1),nrow=2,ncol=5,byrow = TRUE)

[,1] [,2] [,3] [,4] [,5]

[1,]    1    2    1    3    2

[2,]    2    1    3    2    1

I'm using the code below for combinations of sets of 3 from the 5 columns:

l<- apply(X = combn(seq_len(ncol(z)), 3),MAR = 2,FUN = function(jj) {apply(z[, jj], 1, paste, collapse="") })

This exports what I need:

[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]

[1,] "121" "123" "122" "113" "112" "132" "213" "212" "232" "132"

[2,] "213" "212" "211" "232" "231" "221" "132" "131" "121" "321"

The problem starts when I'm using big data in a matrix, for example when I have a matrix with 15000 rows and 17 columns and I need the combinations of sets of 10 from the 17 columns.

In this example this export takes very long time.

Is there any faster and more efficient way than basic R (maybe some packages or code,or using parallel cpu's) for this example of combinations?

I'm using Windows 7 64-bit, FX 8320, 16GB RAM.

like image 858
pshls Avatar asked Oct 31 '22 04:10

pshls


1 Answers

As @inscaven points out, the real time crunch comes from paste. If we simply had to generate all 17 choose 10 combinations 15000 times, that wouldn't take that long with the advent of a couple of highly optimized packages in R, arrangements and RcppAlgos (I am the author):

set.seed(101)
testMat <- matrix(sample(1000, 15000 * 17, TRUE), nrow = 15000)

library(arrangements)
system.time(lapply(1:15000, function(x) {
    temp <- combinations(x = testMat[x, ], k = 10)
    x
}))
  user  system elapsed 
 6.879   2.133   9.014

library(RcppAlgos)
system.time(lapply(1:15000, function(x) {
    temp <- comboGeneral(testMat[x, ], 10)
    x
}))
  user  system elapsed 
 5.770   2.178   7.953

Compared to combn loaded in base R:

system.time(lapply(1:15000, function(x) {
    temp <- combn(testMat[x, ], 10)
    x
}))
    user  system elapsed 
 261.163   1.093 262.608 

If we must combine our results into a matrix of characters, there isn't much more in base R that we can do. Even if we use either of the optimized libraries mentioned above, we are still left looping over all rows and pasting the results which is slow.

system.time(t1 <- lapply(1:50, function(x) {
    combn(testMat[x, ], 10, paste0, collapse = "")
}))
  user  system elapsed 
 6.847   0.070   6.933

## from package arrangements
system.time(t2 <- lapply(1:50, function(x) {
    apply(combinations(x = testMat[x, ], k = 10), 1, paste0, collapse = "")
}))
  user  system elapsed 
 6.318   0.032   6.353

This isn't really a win. We need a new approach.

Enter Rcpp

//[[Rcpp::export]]
CharacterVector pasteCombos(int n, int r, CharacterVector v, int numRows) {

    int r1 = r - 1, r2 = r - 2;
    int numIter, count = 0;
    CharacterVector comboVec = Rcpp::no_init_vector(numRows);

    std::vector<int> z(r);
    std::iota(z.begin(), z.end(), 0);

    while (count < numRows) {
        numIter = n - z[r1];
        if ((numIter + count) > numRows)
            numIter = numRows - count;

        for (int i = 0; i < numIter; ++i, ++count, ++z[r1])
            for (int k = 0; k < r; ++k)
                comboVec[count] += v[z[k]];

        for (int i = r2; i >= 0; i--) {
            if (z[i] != (n - r + i)) {
                ++z[i];
                for (int k = (i + 1); k < r; ++k) 
                    z[k] = z[k - 1] + 1;

                break;
            }
        }
    }

    return comboVec;
}

This function simply generates all combinations of v choose r and paste the results on the fly via +=. This generates a vector without the necessity of dealing with rows of a matrix. Let's see if we have any improvements.

numCombs <- choose(17, 10)
charMat <- matrix(as.character(testMat), nrow = 15000)

funOP <- function(z, r) {
    apply(X = combn(seq_len(ncol(z)), r), MAR = 2,FUN = function(jj) {apply(z[, jj], 1, paste, collapse="") })
}

system.time(t1 <- funOP(testMat[1:100, ], 10))
   user  system elapsed 
 22.221   0.110  22.330 

system.time(t2 <- lapply(1:100, function(x) {
     pasteCombos(17, 10, charMat[x,], numCombs)
}))
  user  system elapsed 
 7.890   0.085   7.975

Nearly 3 times faster... not bad, but we can do better.

Enter parallel

library(parallel)
system.time(t3 <- mclapply(1:100, function(x) {
    pasteCombos(17, 10, charMat[x,], numCombs)
}, mc.cores = 8)) ## you will have to adjust this on your computer.. I'm running MacOS with 8 cores
  user  system elapsed 
 1.430   0.454   1.912

Now we are talking!!! Nearly 12 times faster!!

Here is a sanity check:

all.equal(t1, do.call(rbind, t2))
# [1] TRUE
all.equal(t1, do.call(rbind, t3))
# [1] TRUE

In total, if we assume we can complete 100 rows in 2 seconds, we can complete our task in a respectable 2 * 150 = 300 seconds = 5 minutes.

like image 53
Joseph Wood Avatar answered Nov 15 '22 07:11

Joseph Wood