I'm using basic R for combinations.
For example, let's say I have a matrix with 2 rows and 5 columns:
z<-matrix(c(1, 2, 1, 3, 2, 2, 1, 3, 2, 1),nrow=2,ncol=5,byrow = TRUE)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 1 3 2
[2,] 2 1 3 2 1
I'm using the code below for combinations of sets of 3 from the 5 columns:
l<- apply(X = combn(seq_len(ncol(z)), 3),MAR = 2,FUN = function(jj) {apply(z[, jj], 1, paste, collapse="") })
This exports what I need:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "121" "123" "122" "113" "112" "132" "213" "212" "232" "132"
[2,] "213" "212" "211" "232" "231" "221" "132" "131" "121" "321"
The problem starts when I'm using big data in a matrix, for example when I have a matrix with 15000 rows and 17 columns and I need the combinations of sets of 10 from the 17 columns.
In this example this export takes very long time.
Is there any faster and more efficient way than basic R (maybe some packages or code,or using parallel cpu's) for this example of combinations?
I'm using Windows 7 64-bit, FX 8320, 16GB RAM.
As @inscaven points out, the real time crunch comes from paste
. If we simply had to generate all 17 choose 10 combinations 15000 times, that wouldn't take that long with the advent of a couple of highly optimized packages in R
, arrangements
and RcppAlgos
(I am the author):
set.seed(101)
testMat <- matrix(sample(1000, 15000 * 17, TRUE), nrow = 15000)
library(arrangements)
system.time(lapply(1:15000, function(x) {
temp <- combinations(x = testMat[x, ], k = 10)
x
}))
user system elapsed
6.879 2.133 9.014
library(RcppAlgos)
system.time(lapply(1:15000, function(x) {
temp <- comboGeneral(testMat[x, ], 10)
x
}))
user system elapsed
5.770 2.178 7.953
Compared to combn
loaded in base R
:
system.time(lapply(1:15000, function(x) {
temp <- combn(testMat[x, ], 10)
x
}))
user system elapsed
261.163 1.093 262.608
If we must combine our results into a matrix of characters, there isn't much more in base R
that we can do. Even if we use either of the optimized libraries mentioned above, we are still left looping over all rows and pasting the results which is slow.
system.time(t1 <- lapply(1:50, function(x) {
combn(testMat[x, ], 10, paste0, collapse = "")
}))
user system elapsed
6.847 0.070 6.933
## from package arrangements
system.time(t2 <- lapply(1:50, function(x) {
apply(combinations(x = testMat[x, ], k = 10), 1, paste0, collapse = "")
}))
user system elapsed
6.318 0.032 6.353
This isn't really a win. We need a new approach.
Enter Rcpp
//[[Rcpp::export]]
CharacterVector pasteCombos(int n, int r, CharacterVector v, int numRows) {
int r1 = r - 1, r2 = r - 2;
int numIter, count = 0;
CharacterVector comboVec = Rcpp::no_init_vector(numRows);
std::vector<int> z(r);
std::iota(z.begin(), z.end(), 0);
while (count < numRows) {
numIter = n - z[r1];
if ((numIter + count) > numRows)
numIter = numRows - count;
for (int i = 0; i < numIter; ++i, ++count, ++z[r1])
for (int k = 0; k < r; ++k)
comboVec[count] += v[z[k]];
for (int i = r2; i >= 0; i--) {
if (z[i] != (n - r + i)) {
++z[i];
for (int k = (i + 1); k < r; ++k)
z[k] = z[k - 1] + 1;
break;
}
}
}
return comboVec;
}
This function simply generates all combinations of v
choose r
and paste the results on the fly via +=
. This generates a vector without the necessity of dealing with rows of a matrix. Let's see if we have any improvements.
numCombs <- choose(17, 10)
charMat <- matrix(as.character(testMat), nrow = 15000)
funOP <- function(z, r) {
apply(X = combn(seq_len(ncol(z)), r), MAR = 2,FUN = function(jj) {apply(z[, jj], 1, paste, collapse="") })
}
system.time(t1 <- funOP(testMat[1:100, ], 10))
user system elapsed
22.221 0.110 22.330
system.time(t2 <- lapply(1:100, function(x) {
pasteCombos(17, 10, charMat[x,], numCombs)
}))
user system elapsed
7.890 0.085 7.975
Nearly 3 times faster... not bad, but we can do better.
Enter parallel
library(parallel)
system.time(t3 <- mclapply(1:100, function(x) {
pasteCombos(17, 10, charMat[x,], numCombs)
}, mc.cores = 8)) ## you will have to adjust this on your computer.. I'm running MacOS with 8 cores
user system elapsed
1.430 0.454 1.912
Now we are talking!!! Nearly 12 times faster!!
Here is a sanity check:
all.equal(t1, do.call(rbind, t2))
# [1] TRUE
all.equal(t1, do.call(rbind, t3))
# [1] TRUE
In total, if we assume we can complete 100 rows in 2 seconds, we can complete our task in a respectable 2 * 150 = 300 seconds = 5 minutes
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With