I have a vector x of 1,344 unique strings. I want to generate a matrix that gives me all possible groups of three values, regardless of order, and export that to a csv.
I'm running R on EC2 on a m1.large instance w 64bit Ubuntu. When using combn(x, 3) I get an out of memory error:
Error: cannot allocate vector of size 9.0 Gb
The size of the resulting matrix is C1344,3 = 403,716,544 rows and three columns - which is the transpose of the result of combn() function.
I thought of using the bigmemory package to create a file backed big.matrix so I can then assign the results of the combn() function. I can create a preallocated big matrix:
library(bigmemory)
x <- as.character(1:1344)
combos <- 403716544
test <- filebacked.big.matrix(nrow = combos, ncol = 3,
init = 0, backingfile = "test.matrix")
But when I try to allocate the values test <- combn(x, 3)
I still get the same: Error: cannot allocate vector of size 9.0 Gb
I even tried coercing the result of combn(x,3)
but I think that because the combn() function is returning an error, the big.matrix function doesn't work either.
test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc")
Error: cannot allocate vector of size 9.0 Gb
Error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") :
error in evaluating the argument 'x' in selecting a method for function 'as.big.matrix'
Is there a way to combine these two functions together to get what I need? Are there any other ways of achieving this? Thanks.
Here's a function I've written in R, which currently finds its (unexported) home in the LSPM package. You give it the total number of items n
, the number of items to select r
, and the index of the combination you want i
; it returns the values in 1:n
corresponding to combination i
.
".combinadic" <- function(n, r, i) {
# http://msdn.microsoft.com/en-us/library/aa289166(VS.71).aspx
# http://en.wikipedia.org/wiki/Combinadic
if(i < 1 | i > choose(n,r)) stop("'i' must be 0 < i <= n!/(n-r)!")
largestV <- function(n, r, i) {
#v <- n-1
v <- n # Adjusted for one-based indexing
#while(choose(v,r) > i) v <- v-1
while(choose(v,r) >= i) v <- v-1 # Adjusted for one-based indexing
return(v)
}
res <- rep(NA,r)
for(j in 1:r) {
res[j] <- largestV(n,r,i)
i <- i-choose(res[j],r)
n <- res[j]
r <- r-1
}
res <- res + 1
return(res)
}
It allows you to generate each combination based on the value of the lexicographic index:
> .combinadic(1344, 3, 1)
[1] 3 2 1
> .combinadic(1344, 3, 2)
[1] 4 2 1
> .combinadic(1344, 3, 403716544)
[1] 1344 1343 1342
So you just need to loop over 1:403716544 and append the results to a file. It may take awhile, but it's at least feasible (see Dirk's answer). You also may need to do it in several loops, since the vector 1:403716544
will not fit in memory on my machine.
Or you could just port the R code to C/C++ and do the looping / writing there, since it would be a lot faster.
You could first find all 2-way combinations, and then just combine them with the 3d value while saving them every time. This takes a lot less memory:
combn.mod <- function(x,fname){
tmp <- combn(x,2,simplify=F)
n <- length(x)
for ( i in x[-c(n,n-1)]){
# Drop all combinations that contain value i
id <- which(!unlist(lapply(tmp,function(t) i %in% t)))
tmp <- tmp[id]
# add i to all other combinations and write to file
out <- do.call(rbind,lapply(tmp,c,i))
write(t(out),file=fname,ncolumns=3,append=T,sep=",")
}
}
combn.mod(x,"F:/Tmp/Test.txt")
This is not as general as Joshua's answer though, it is specifically for your case. I guess it is faster -again, for this particular case-, but I didn't make the comparison. Function works on my computer using little over 50 Mb (roughly estimated) when applied to your x.
EDIT
On a sidenote: If this is for simulation purposes, I find it hard to believe that any scientific application needs 400+ million simulation runs. You might be asking the correct answer to the wrong question here...
PROOF OF CONCEPT :
I changed the write line by tt[[i]]<-out
, added tt <- list()
before the loop and return(tt) after it. Then:
> do.call(rbind,combn.mod(letters[1:5]))
[,1] [,2] [,3]
[1,] "b" "c" "a"
[2,] "b" "d" "a"
[3,] "b" "e" "a"
[4,] "c" "d" "a"
[5,] "c" "e" "a"
[6,] "d" "e" "a"
[7,] "c" "d" "b"
[8,] "c" "e" "b"
[9,] "d" "e" "b"
[10,] "d" "e" "c"
At a first approximation, every algorithm trades off storage for speed.
You have hit a boundary trying to preallocate your fully enumerated combination matrix. So maybe you should try not to preallocate this matrix but to try, say,
If you think you need the combinations, calculate them somewhere else and store them in a simple db (or, heck, flat file) and look them up -- 9 gb saved
Take advantage of open source, read the code to combn()
and modify it into a client-server thingy: given a call with index number N, it will loop and return the Nth entry. Not efficient, but possibly more easily feasible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With