For a two-variable problem, <code>outer</code> is most likely the best solution for this and if the space to loop over is small enough then we can have <code>expand.grid</code> do our legwork. However, those are ruled out if we have more than two variables and a large space to loop over. <code>outer</code> can't handle more than two variables and <code>expand.grid</code> eats up more memory than I've ever seen a machine be able to take. I've recently found myself writing code like this: <pre class="prettyprint"><code>n<-1000 for(c in 1:n){ for(b in 1:c){ for(a in 1:b){ if(foo(a,b,c)) { bar(a,b,c) } } } } </code></pre> In these cases it does seem like a nested loop is the natural solution (e.g. <code>mapply</code> won't do and there's no nice factors for <code>tapply</code> to use), but is there a better way? It seems like a path to bad code. I suspect that <code>combn</code> might be able to do it somehow, but in my experience it doesn't take long for that to either fall in to the same memory trap as <code>expand.grid</code>. If memory serves, I've also known it to take the ill-advised step of telling me to make changes to my global settings for recursion limits.

This is combinations with repetitions. rcppalgos is likely your best out of the box but at <code>n = 1000L</code>, that's just over 500 million combinations to go through which will take up ~ 2GB of ram. <pre class="prettyprint"><code>library(RcppAlgos) n = 1000L mat <- comboGeneral(n, 3L, repetition = TRUE) </code></pre> Now there are two routes to go. If you have the RAM and your function is able to be vectorized, you can do the above very quickly. Let's say if the sum of the combination is greater than 1000 you want the means of the combination, other wise you want the sum of the combination. <pre class="prettyprint"><code>res <- if (rowSums(mat) > 1000L) rowMeans(mat) else rowSums(mat) ## Error: cannot allocate vector of size 1.2 Gb </code></pre> Oh no! I get the dreaded allocate vector error. rcppalgos allows you to return the result of a function. But note that it returns a list and is a lot less fast because it is going to have to evaluate your R function instead of staying in c++. Because of this, I changed to <code>n = 100L</code> because I do not have all day... <pre class="prettyprint"><code>comboGeneral(100L, 3L, repetition = TRUE, FUN = function(x) { if (sum(x) > 100L) mean(x) else sum(x) } ) </code></pre> If I had a static set where I was always choosing 3 combinations out of <code>n</code>, I would likely use <code>Rcpp</code> code directly depending on what <code>foo(a,b,c)</code> and <code>bar(a,b,c)</code> are but first I would like to know more about the functions.

How can I avoid writing nested for loops for large data sets?

Tags:

r

nested-loops

For a two-variable problem, outer is most likely the best solution for this and if the space to loop over is small enough then we can have expand.grid do our legwork. However, those are ruled out if we have more than two variables and a large space to loop over. outer can't handle more than two variables and expand.grid eats up more memory than I've ever seen a machine be able to take.

I've recently found myself writing code like this:

n<-1000
for(c in 1:n){
    for(b in 1:c){
        for(a in 1:b){
            if(foo(a,b,c))
            {
                bar(a,b,c)
            }
        }
    }
}

In these cases it does seem like a nested loop is the natural solution (e.g. mapply won't do and there's no nice factors for tapply to use), but is there a better way? It seems like a path to bad code.

I suspect that combn might be able to do it somehow, but in my experience it doesn't take long for that to either fall in to the same memory trap as expand.grid. If memory serves, I've also known it to take the ill-advised step of telling me to make changes to my global settings for recursion limits.

848

asked Jun 05 '20 22:06

J. Mini

2 Answers

This is combinations with repetitions. rcppalgos is likely your best out of the box but at n = 1000L, that's just over 500 million combinations to go through which will take up ~ 2GB of ram.

library(RcppAlgos)
n = 1000L
mat <- comboGeneral(n, 3L, repetition = TRUE)

Now there are two routes to go. If you have the RAM and your function is able to be vectorized, you can do the above very quickly. Let's say if the sum of the combination is greater than 1000 you want the means of the combination, other wise you want the sum of the combination.

res <- if (rowSums(mat) > 1000L) 
  rowMeans(mat)
else
  rowSums(mat)

## Error: cannot allocate vector of size 1.2 Gb

Oh no! I get the dreaded allocate vector error. rcppalgos allows you to return the result of a function. But note that it returns a list and is a lot less fast because it is going to have to evaluate your R function instead of staying in c++. Because of this, I changed to n = 100L because I do not have all day...

comboGeneral(100L, 3L, repetition = TRUE,
                        FUN = function(x) { 
                          if (sum(x) > 100L)
                            mean(x)
                          else
                            sum(x)
                        }
)

If I had a static set where I was always choosing 3 combinations out of n, I would likely use Rcpp code directly depending on what foo(a,b,c) and bar(a,b,c) are but first I would like to know more about the functions.

answered Oct 23 '22 13:10

Cole

My previous function lazyExpandGrid is not a perfect match, but I think it addresses your concern about memory-exhaustion. Other languages have the prospect of a lazy iterator; R has it in the iterators package, and since I'm not proficient with it, some time ago I wrote this gist to address an itch.

One problem with lazyExpandGrid is that it expects the factors to be pre-defined. This can be handled with a quick condition, so it'll be memory-efficient though admittedly not space-efficient. I don't think it'd be a quick fix to implement conditionals in the method, since its mechanism for lazily dealing with the expansion is knowing mathematically which index attaches to which combination of factors ... and conditions will bust that.

Here's how that function can work here:

n <- 3
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
while (length(thistask <- it$nextItem())) {
  if (with(thistask, bb > aa || cc > bb)) next
  print(jsonlite::toJSON(thistask))
}
# [{"aa":1,"bb":1,"cc":1}] 
# [{"aa":2,"bb":1,"cc":1}] 
# [{"aa":3,"bb":1,"cc":1}] 
# [{"aa":2,"bb":2,"cc":1}] 
# [{"aa":3,"bb":2,"cc":1}] 
# [{"aa":3,"bb":3,"cc":1}] 
# [{"aa":2,"bb":2,"cc":2}] 
# [{"aa":3,"bb":2,"cc":2}] 
# [{"aa":3,"bb":3,"cc":2}] 
# [{"aa":3,"bb":3,"cc":3}] 

### to demonstrate what an exhausted lazy-expansion looks like
it$nextItem()
# NULL
it$nextItem()
# NULL

(Note how the conditional with next skips those combinations.)

That would translate to your flow as:

n <- 1000
it <- lazyExpandGrid(aa = 1:n, bb = 1:n, cc = 1:n)
it
# lazyExpandGrid: 4 factors, 1e+09 rows
#   $ index : 0

while (length(thistask <- it$nextItem())) {
  if (with(thistask, bb > aa || cc > bb)) next
  with(thistask, {
    if (foo(aa, bb, cc)) bar(aa, bb, cc)
  })
}

(Or without the with, using thistask$aa, etc.)

Note: I'm not going to lie, though, this simplifies the flow, it does not make it fast. In this case, doing something 1e+09 times is going to take time, and I don't know of anything that will help with that besides parallel operations and perhaps a friendly cluster of R hosts. (I started running an empty no-op while loop as above and it took 268 seconds to get through 822K of them. I hope you have a lot of processing power.)

answered Oct 23 '22 14:10

r2evans

Related questions
                            
                                How to choose only positive values?
                            
                                How to remove columns from a data.frame by data type?
                            
                                Dealing with spaces and "weird" characters in column names with dplyr::rename()
                            
                                character string is not in a standard unambiguous format
                            
                                how to access yaml metadata from knitr
                            
                                How to plot a bipartite graph in R
                            
                                ggplot2: How to curve small gaussian densities on a regression line?
                            
                                Change size of actionButton and its label
                            
                                Bad predictions from ranger compared to randomForest
                            
                                Rotation of facets in ggplot2
                            
                                How to get survdiff returned p value
                            
                                Decimal numbers in R stargazer
                            
                                Log all warnings with futile.logger
                            
                                Logarithmic y with geom_smooth
                            
                                How to produce both .html and .md in .Rrmd using Rmarkdown and rename them?
                            
                                plotly stacked bar chart with over 100 categories
                            
                                How to add mean, and mode to ggplot histogram?
                            
                                Add the index of list to bind_rows?
                            
                                how to make single stacked bar chart in ggplot2
                            
                                Replace words in an unstructured text file using a for loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With