Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I avoid writing nested for loops for large data sets?

Tags:

r

nested-loops

For a two-variable problem, outer is most likely the best solution for this and if the space to loop over is small enough then we can have expand.grid do our legwork. However, those are ruled out if we have more than two variables and a large space to loop over. outer can't handle more than two variables and expand.grid eats up more memory than I've ever seen a machine be able to take.

I've recently found myself writing code like this:

n<-1000
for(c in 1:n){
    for(b in 1:c){
        for(a in 1:b){
            if(foo(a,b,c))
            {
                bar(a,b,c)
            }
        }
    }
}

In these cases it does seem like a nested loop is the natural solution (e.g. mapply won't do and there's no nice factors for tapply to use), but is there a better way? It seems like a path to bad code.

I suspect that combn might be able to do it somehow, but in my experience it doesn't take long for that to either fall in to the same memory trap as expand.grid. If memory serves, I've also known it to take the ill-advised step of telling me to make changes to my global settings for recursion limits.

like image 848
J. Mini Avatar asked Jun 05 '20 22:06

J. Mini


People also ask

How many nested loops should be avoided in a function?

Two nested loops to iterate over a 2-dimensional data structure, or three nested loops to iterate over a 3-dimensional data structure, is totally natural and nothing to avoid. What should be avoided is using nested loops for unrelated purposes in the same function - it makes the function harder to follow,...

How to iterate over a triple nested loop?

If you really need to iterate over them, you need to iterate over them. The general guidance of triple nested loops is to avoid spaghetti code which is hard to follow. If your inner and second loops are functions, this can improve code clarity: void process_layer (layer_t layer) { for ( x ) { for ( y ) { do_something } } }

What is a nested loop in a database?

A nested loop in a database engine refers to a Full table scan on the inner table of a join. They aren’t always bad - in fact, they can be the most efficient way to answer a query if the inner table is very small and its base table storage fits on a single page - but for any table that is nontrivially sized, a nested loop will be bad.

What is the simplest possible solution to a nested loop?

Related to that is that in any complex construct, which a nested loop can be considered, then you should always ask is that the simplest possible solution as there's potential for a missed solution needing less loops.


2 Answers

This is combinations with repetitions. rcppalgos is likely your best out of the box but at n = 1000L, that's just over 500 million combinations to go through which will take up ~ 2GB of ram.

library(RcppAlgos)
n = 1000L
mat <- comboGeneral(n, 3L, repetition = TRUE)

Now there are two routes to go. If you have the RAM and your function is able to be vectorized, you can do the above very quickly. Let's say if the sum of the combination is greater than 1000 you want the means of the combination, other wise you want the sum of the combination.

res <- if (rowSums(mat) > 1000L) 
  rowMeans(mat)
else
  rowSums(mat)

## Error: cannot allocate vector of size 1.2 Gb

Oh no! I get the dreaded allocate vector error. rcppalgos allows you to return the result of a function. But note that it returns a list and is a lot less fast because it is going to have to evaluate your R function instead of staying in c++. Because of this, I changed to n = 100L because I do not have all day...

comboGeneral(100L, 3L, repetition = TRUE,
                        FUN = function(x) { 
                          if (sum(x) > 100L)
                            mean(x)
                          else
                            sum(x)
                        }
)

If I had a static set where I was always choosing 3 combinations out of n, I would likely use Rcpp code directly depending on what foo(a,b,c) and bar(a,b,c) are but first I would like to know more about the functions.

like image 63
Cole Avatar answered Oct 23 '22 13:10

Cole


My previous function lazyExpandGrid is not a perfect match, but I think it addresses your concern about memory-exhaustion. Other languages have the prospect of a lazy iterator; R has it in the iterators package, and since I'm not proficient with it, some time ago I wrote this gist to address an itch.

One problem with lazyExpandGrid is that it expects the factors to be pre-defined. This can be handled with a quick condition, so it'll be memory-efficient though admittedly not space-efficient. I don't think it'd be a quick fix to implement conditionals in the method, since its mechanism for lazily dealing with the expansion is knowing mathematically which index attaches to which combination of factors ... and conditions will bust that.

Here's how that function can work here:

n <- 3
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
while (length(thistask <- it$nextItem())) {
  if (with(thistask, bb > aa || cc > bb)) next
  print(jsonlite::toJSON(thistask))
}
# [{"aa":1,"bb":1,"cc":1}] 
# [{"aa":2,"bb":1,"cc":1}] 
# [{"aa":3,"bb":1,"cc":1}] 
# [{"aa":2,"bb":2,"cc":1}] 
# [{"aa":3,"bb":2,"cc":1}] 
# [{"aa":3,"bb":3,"cc":1}] 
# [{"aa":2,"bb":2,"cc":2}] 
# [{"aa":3,"bb":2,"cc":2}] 
# [{"aa":3,"bb":3,"cc":2}] 
# [{"aa":3,"bb":3,"cc":3}] 

### to demonstrate what an exhausted lazy-expansion looks like
it$nextItem()
# NULL
it$nextItem()
# NULL

(Note how the conditional with next skips those combinations.)

That would translate to your flow as:

n <- 1000
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
it
# lazyExpandGrid: 4 factors, 1e+09 rows
#   $ index : 0

while (length(thistask <- it$nextItem())) {
  if (with(thistask, bb > aa || cc > bb)) next
  with(thistask, {
    if (foo(aa, bb, cc)) bar(aa, bb, cc)
  })
}

(Or without the with, using thistask$aa, etc.)

Note: I'm not going to lie, though, this simplifies the flow, it does not make it fast. In this case, doing something 1e+09 times is going to take time, and I don't know of anything that will help with that besides parallel operations and perhaps a friendly cluster of R hosts. (I started running an empty no-op while loop as above and it took 268 seconds to get through 822K of them. I hope you have a lot of processing power.)

like image 5
r2evans Avatar answered Oct 23 '22 14:10

r2evans