Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collecting an unknown number of results in a loop

Tags:

loops

append

r

What is the idiomatic way to collect results in a loop in R if the number of final results is not known beforehand? Here's a toy example:

results = vector('integer')
i=1L
while (i < bigBigBIGNumber)  {
    if (someCondition(i)) results = c(results, i)
    i = i+1
}
results

The problem with this example is that (I assume) it will have quadratic complexity as the vector needs to be re-allocated at every append. (Is this correct?) I'm looking for a solution that avoids this.

I found Filter, but it requires pre-generating 1:bigBigBIGNumber which I want to avoid to conserve memory. (Question: does for (i in 1:N) also pre-generate 1:N and keep it in memory?)

I could make something like a linked list like this:

results = list()
i=1L
while (i < bigBigBIGNumber)  {
    if (someCondition(i)) results = list(results, i)
    i = i+1
}
unlist(results)

(Note that this is not concatenation. It's building a structure like list(list(list(1),2),3), then flattening with unlist.)

Is there a better way than this? What is the idiomatic way that's typically used? (I am very new to R.) I'm looking for suggestion on how to tackle this type of problem. Suggestions both about compact (easy to write) and fast code are most welcome! (But I'd like to focus on fast and memory efficient.)

like image 812
Szabolcs Avatar asked May 12 '13 21:05

Szabolcs


3 Answers

Here is an algorithm that doubles the size of the output list as it fills up, achieving somewhat linear computation times as show the benchmark tests:

test <- function(bigBigBIGNumber = 1000) {

  n <- 10L
  results <- vector("list", n)
  m <- 0L
  i <- 1L
  while (i < bigBigBIGNumber)  {
    if (runif(1) > 0.5) {
      m <- m + 1L
      results[[m]] <- i
      if (m == n) {
        results <- c(results, vector("list", n))
        n <- n * 2L
      }
    }
    i = i + 1L
  }
  unlist(results)
}

system.time(test(1000))
#    user  system elapsed 
#   0.008   0.000   0.008 
system.time(test(10000))
#    user  system elapsed 
#   0.090   0.002   0.093 
system.time(test(100000))
#    user  system elapsed 
#   0.885   0.051   0.936 
system.time(test(1000000))
#    user  system elapsed 
#   9.428   0.339   9.776 
like image 186
flodel Avatar answered Sep 23 '22 08:09

flodel


Presumably there's a maximum size you're willing to tolerate; pre-allocate and fill up to that level, then trim if necessary. This avoids the risk of not being able to satisfy the request to double in size, even when only a small additional amount of memory might be required; it fails early, and involves only one rather than log(n) re-allocations. Here's a function that takes a maximum size, a generating function, and a token that the generating function returns when there is nothing left to generate. We get up to n results before returning

filln <-
    function(n, FUN, ..., RESULT_TYPE="numeric", DONE_TOKEN=NA_real_)
{
    results <- vector(RESULT_TYPE, n)
    i <- 0L
    while (i < n) {
        ans <- FUN(..., DONE_TOKEN=DONE_TOKEN)
        if (identical(ans, DONE_TOKEN))
            break
        i <- i + 1L
        results[[i]] <- ans
    }

    if (i == n)
        warning("intolerably large result")
   else length(results) <- i
   results
}

Here's a generator

fun <- function(thresh, DONE_TOKEN) {
    x <- rnorm(1)
    if (x > thresh) DONE_TOKEN else x
}

and in action

> set.seed(123L); length(filln(10000, fun, 3))
[1] 163
> set.seed(123L); length(filln(10000, fun, 4))
[1] 10000
Warning message:
In filln(10000, fun, 4) : intolerably large result
> set.seed(123L); length(filln(100000, fun, 4))
[1] 23101

We can benchmark the overhead, approximately, by comparing to something that knows in advance how much space is required

f1 <- function(n, FUN, ...) {
    i <- 0L
    result <- numeric(n)
    while (i < n) {
        i <- i + 1L
        result[i] <- FUN(...)
    }
    result
}

Here we check the timing and value of a single result

>     set.seed(123L); system.time(res0 <- filln(100000, fun, 4))
   user  system elapsed 
  0.944   0.000   0.948 
>     set.seed(123L); system.time(res1 <- f1(23101, fun, 4))
   user  system elapsed 
  0.688   0.000   0.689 
> identical(res0, res1)
[1] TRUE

which for this example is of course eclipsed by the simple vector solution(s)

set.seed(123L); system.time(res2 <- rnorm(23101))
identical(res0, res2)
like image 27
Martin Morgan Avatar answered Sep 22 '22 08:09

Martin Morgan


If you can't compute 1:bigBigNumber, count the entries, create the vector, then populate it.

num <- 0L
i <- 0L
while (i < bigBigNumber) {
   if (someCondition(i)) num <- num + 1L 
   i <- i + 1L
}
result <- integer(num)
num <- 0L
while (i < bigBigNumber) { 
  if (someCondition(i)) { 
     result[num] <- i
     num <- num + 1L } 
  i <- i + 1L
}

(This code is not tested.)

If you can compute 1:bigBigBIGNumber, this will also work:

I assume that you want to call a function, and not simply tack on the indices themselves. Something like this may be closer to what you want:

values <- seq(bigBigBIGNumber)
sapply(values[someCondition(values)], my_function)
like image 22
Matthew Lundberg Avatar answered Sep 21 '22 08:09

Matthew Lundberg