Collecting an unknown number of results in a loop

Question

What is the idiomatic way to collect results in a loop in R if the number of final results is not known beforehand? Here's a toy example:

results = vector('integer')
i=1L
while (i < bigBigBIGNumber)  {
    if (someCondition(i)) results = c(results, i)
    i = i+1
}
results

The problem with this example is that (I assume) it will have quadratic complexity as the vector needs to be re-allocated at every append. (Is this correct?) I'm looking for a solution that avoids this.

I found Filter, but it requires pre-generating 1:bigBigBIGNumber which I want to avoid to conserve memory. (Question: does for (i in 1:N) also pre-generate 1:N and keep it in memory?)

I could make something like a linked list like this:

results = list()
i=1L
while (i < bigBigBIGNumber)  {
    if (someCondition(i)) results = list(results, i)
    i = i+1
}
unlist(results)

(Note that this is not concatenation. It's building a structure like list(list(list(1),2),3), then flattening with unlist.)

Is there a better way than this? What is the idiomatic way that's typically used? (I am very new to R.) I'm looking for suggestion on how to tackle this type of problem. Suggestions both about compact (easy to write) and fast code are most welcome! (But I'd like to focus on fast and memory efficient.)

flodel · Accepted Answer

Here is an algorithm that doubles the size of the output list as it fills up, achieving somewhat linear computation times as show the benchmark tests:

test <- function(bigBigBIGNumber = 1000) {

  n <- 10L
  results <- vector("list", n)
  m <- 0L
  i <- 1L
  while (i < bigBigBIGNumber)  {
    if (runif(1) > 0.5) {
      m <- m + 1L
      results[[m]] <- i
      if (m == n) {
        results <- c(results, vector("list", n))
        n <- n * 2L
      }
    }
    i = i + 1L
  }
  unlist(results)
}

system.time(test(1000))
#    user  system elapsed 
#   0.008   0.000   0.008 
system.time(test(10000))
#    user  system elapsed 
#   0.090   0.002   0.093 
system.time(test(100000))
#    user  system elapsed 
#   0.885   0.051   0.936 
system.time(test(1000000))
#    user  system elapsed 
#   9.428   0.339   9.776

Martin Morgan · Answer

Presumably there's a maximum size you're willing to tolerate; pre-allocate and fill up to that level, then trim if necessary. This avoids the risk of not being able to satisfy the request to double in size, even when only a small additional amount of memory might be required; it fails early, and involves only one rather than log(n) re-allocations. Here's a function that takes a maximum size, a generating function, and a token that the generating function returns when there is nothing left to generate. We get up to n results before returning

filln <-
    function(n, FUN, ..., RESULT_TYPE="numeric", DONE_TOKEN=NA_real_)
{
    results <- vector(RESULT_TYPE, n)
    i <- 0L
    while (i < n) {
        ans <- FUN(..., DONE_TOKEN=DONE_TOKEN)
        if (identical(ans, DONE_TOKEN))
            break
        i <- i + 1L
        results[[i]] <- ans
    }

    if (i == n)
        warning("intolerably large result")
   else length(results) <- i
   results
}

Here's a generator

fun <- function(thresh, DONE_TOKEN) {
    x <- rnorm(1)
    if (x > thresh) DONE_TOKEN else x
}

and in action

> set.seed(123L); length(filln(10000, fun, 3))
[1] 163
> set.seed(123L); length(filln(10000, fun, 4))
[1] 10000
Warning message:
In filln(10000, fun, 4) : intolerably large result
> set.seed(123L); length(filln(100000, fun, 4))
[1] 23101

We can benchmark the overhead, approximately, by comparing to something that knows in advance how much space is required

f1 <- function(n, FUN, ...) {
    i <- 0L
    result <- numeric(n)
    while (i < n) {
        i <- i + 1L
        result[i] <- FUN(...)
    }
    result
}

Here we check the timing and value of a single result

>     set.seed(123L); system.time(res0 <- filln(100000, fun, 4))
   user  system elapsed 
  0.944   0.000   0.948 
>     set.seed(123L); system.time(res1 <- f1(23101, fun, 4))
   user  system elapsed 
  0.688   0.000   0.689 
> identical(res0, res1)
[1] TRUE

which for this example is of course eclipsed by the simple vector solution(s)

set.seed(123L); system.time(res2 <- rnorm(23101))
identical(res0, res2)

Matthew Lundberg · Answer

If you can't compute 1:bigBigNumber, count the entries, create the vector, then populate it.

num <- 0L
i <- 0L
while (i < bigBigNumber) {
   if (someCondition(i)) num <- num + 1L 
   i <- i + 1L
}
result <- integer(num)
num <- 0L
while (i < bigBigNumber) { 
  if (someCondition(i)) { 
     result[num] <- i
     num <- num + 1L } 
  i <- i + 1L
}

(This code is not tested.)

If you can compute 1:bigBigBIGNumber, this will also work:

I assume that you want to call a function, and not simply tack on the indices themselves. Something like this may be closer to what you want:

values <- seq(bigBigBIGNumber)
sapply(values[someCondition(values)], my_function)

Collecting an unknown number of results in a loop

Tags:

loops

append

r

Szabolcs

3 Answers

flodel

Martin Morgan

Matthew Lundberg

Recent Activity

Donate For Us

Collecting an unknown number of results in a loop

Tags:

loops

append

r

Szabolcs

3 Answers

flodel

Martin Morgan

Matthew Lundberg

Related questions

Recent Activity

Donate For Us