For a two-variable problem, outer
is most likely the best solution for this and if the space to loop over is small enough then we can have expand.grid
do our legwork. However, those are ruled out if we have more than two variables and a large space to loop over. outer
can't handle more than two variables and expand.grid
eats up more memory than I've ever seen a machine be able to take.
I've recently found myself writing code like this:
n<-1000
for(c in 1:n){
for(b in 1:c){
for(a in 1:b){
if(foo(a,b,c))
{
bar(a,b,c)
}
}
}
}
In these cases it does seem like a nested loop is the natural solution (e.g. mapply
won't do and there's no nice factors for tapply
to use), but is there a better way? It seems like a path to bad code.
I suspect that combn
might be able to do it somehow, but in my experience it doesn't take long for that to either fall in to the same memory trap as expand.grid
. If memory serves, I've also known it to take the ill-advised step of telling me to make changes to my global settings for recursion limits.
Two nested loops to iterate over a 2-dimensional data structure, or three nested loops to iterate over a 3-dimensional data structure, is totally natural and nothing to avoid. What should be avoided is using nested loops for unrelated purposes in the same function - it makes the function harder to follow,...
If you really need to iterate over them, you need to iterate over them. The general guidance of triple nested loops is to avoid spaghetti code which is hard to follow. If your inner and second loops are functions, this can improve code clarity: void process_layer (layer_t layer) { for ( x ) { for ( y ) { do_something } } }
A nested loop in a database engine refers to a Full table scan on the inner table of a join. They aren’t always bad - in fact, they can be the most efficient way to answer a query if the inner table is very small and its base table storage fits on a single page - but for any table that is nontrivially sized, a nested loop will be bad.
Related to that is that in any complex construct, which a nested loop can be considered, then you should always ask is that the simplest possible solution as there's potential for a missed solution needing less loops.
This is combinations with repetitions. rcppalgos is likely your best out of the box but at n = 1000L
, that's just over 500 million combinations to go through which will take up ~ 2GB of ram.
library(RcppAlgos)
n = 1000L
mat <- comboGeneral(n, 3L, repetition = TRUE)
Now there are two routes to go. If you have the RAM and your function is able to be vectorized, you can do the above very quickly. Let's say if the sum of the combination is greater than 1000 you want the means of the combination, other wise you want the sum of the combination.
res <- if (rowSums(mat) > 1000L)
rowMeans(mat)
else
rowSums(mat)
## Error: cannot allocate vector of size 1.2 Gb
Oh no! I get the dreaded allocate vector error. rcppalgos allows you to return the result of a function. But note that it returns a list and is a lot less fast because it is going to have to evaluate your R function instead of staying in c++. Because of this, I changed to n = 100L
because I do not have all day...
comboGeneral(100L, 3L, repetition = TRUE,
FUN = function(x) {
if (sum(x) > 100L)
mean(x)
else
sum(x)
}
)
If I had a static set where I was always choosing 3 combinations out of n
, I would likely use Rcpp
code directly depending on what foo(a,b,c)
and bar(a,b,c)
are but first I would like to know more about the functions.
My previous function lazyExpandGrid
is not a perfect match, but I think it addresses your concern about memory-exhaustion. Other languages have the prospect of a lazy iterator; R has it in the iterators
package, and since I'm not proficient with it, some time ago I wrote this gist to address an itch.
One problem with lazyExpandGrid
is that it expects the factors to be pre-defined. This can be handled with a quick condition, so it'll be memory-efficient though admittedly not space-efficient. I don't think it'd be a quick fix to implement conditionals in the method, since its mechanism for lazily dealing with the expansion is knowing mathematically which index attaches to which combination of factors ... and conditions will bust that.
Here's how that function can work here:
n <- 3
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
while (length(thistask <- it$nextItem())) {
if (with(thistask, bb > aa || cc > bb)) next
print(jsonlite::toJSON(thistask))
}
# [{"aa":1,"bb":1,"cc":1}]
# [{"aa":2,"bb":1,"cc":1}]
# [{"aa":3,"bb":1,"cc":1}]
# [{"aa":2,"bb":2,"cc":1}]
# [{"aa":3,"bb":2,"cc":1}]
# [{"aa":3,"bb":3,"cc":1}]
# [{"aa":2,"bb":2,"cc":2}]
# [{"aa":3,"bb":2,"cc":2}]
# [{"aa":3,"bb":3,"cc":2}]
# [{"aa":3,"bb":3,"cc":3}]
### to demonstrate what an exhausted lazy-expansion looks like
it$nextItem()
# NULL
it$nextItem()
# NULL
(Note how the conditional with next
skips those combinations.)
That would translate to your flow as:
n <- 1000
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
it
# lazyExpandGrid: 4 factors, 1e+09 rows
# $ index : 0
while (length(thistask <- it$nextItem())) {
if (with(thistask, bb > aa || cc > bb)) next
with(thistask, {
if (foo(aa, bb, cc)) bar(aa, bb, cc)
})
}
(Or without the with
, using thistask$aa
, etc.)
Note: I'm not going to lie, though, this simplifies the flow, it does not make it fast. In this case, doing something 1e+09
times is going to take time, and I don't know of anything that will help with that besides parallel operations and perhaps a friendly cluster of R hosts. (I started running an empty no-op while
loop as above and it took 268 seconds to get through 822K of them. I hope you have a lot of processing power.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With