What is the idiomatic way to collect results in a loop in R if the number of final results is not known beforehand? Here's a toy example:
results = vector('integer')
i=1L
while (i < bigBigBIGNumber) {
if (someCondition(i)) results = c(results, i)
i = i+1
}
results
The problem with this example is that (I assume) it will have quadratic complexity as the vector needs to be re-allocated at every append. (Is this correct?) I'm looking for a solution that avoids this.
I found Filter
, but it requires pre-generating 1:bigBigBIGNumber
which I want to avoid to conserve memory. (Question: does for (i in 1:N)
also pre-generate 1:N
and keep it in memory?)
I could make something like a linked list like this:
results = list()
i=1L
while (i < bigBigBIGNumber) {
if (someCondition(i)) results = list(results, i)
i = i+1
}
unlist(results)
(Note that this is not concatenation. It's building a structure like list(list(list(1),2),3)
, then flattening with unlist
.)
Is there a better way than this? What is the idiomatic way that's typically used? (I am very new to R.) I'm looking for suggestion on how to tackle this type of problem. Suggestions both about compact (easy to write) and fast code are most welcome! (But I'd like to focus on fast and memory efficient.)
Here is an algorithm that doubles the size of the output list as it fills up, achieving somewhat linear computation times as show the benchmark tests:
test <- function(bigBigBIGNumber = 1000) {
n <- 10L
results <- vector("list", n)
m <- 0L
i <- 1L
while (i < bigBigBIGNumber) {
if (runif(1) > 0.5) {
m <- m + 1L
results[[m]] <- i
if (m == n) {
results <- c(results, vector("list", n))
n <- n * 2L
}
}
i = i + 1L
}
unlist(results)
}
system.time(test(1000))
# user system elapsed
# 0.008 0.000 0.008
system.time(test(10000))
# user system elapsed
# 0.090 0.002 0.093
system.time(test(100000))
# user system elapsed
# 0.885 0.051 0.936
system.time(test(1000000))
# user system elapsed
# 9.428 0.339 9.776
Presumably there's a maximum size you're willing to tolerate; pre-allocate and fill up to that level, then trim if necessary. This avoids the risk of not being able to satisfy the request to double in size, even when only a small additional amount of memory might be required; it fails early, and involves only one rather than log(n) re-allocations. Here's a function that takes a maximum size, a generating function, and a token that the generating function returns when there is nothing left to generate. We get up to n results before returning
filln <-
function(n, FUN, ..., RESULT_TYPE="numeric", DONE_TOKEN=NA_real_)
{
results <- vector(RESULT_TYPE, n)
i <- 0L
while (i < n) {
ans <- FUN(..., DONE_TOKEN=DONE_TOKEN)
if (identical(ans, DONE_TOKEN))
break
i <- i + 1L
results[[i]] <- ans
}
if (i == n)
warning("intolerably large result")
else length(results) <- i
results
}
Here's a generator
fun <- function(thresh, DONE_TOKEN) {
x <- rnorm(1)
if (x > thresh) DONE_TOKEN else x
}
and in action
> set.seed(123L); length(filln(10000, fun, 3))
[1] 163
> set.seed(123L); length(filln(10000, fun, 4))
[1] 10000
Warning message:
In filln(10000, fun, 4) : intolerably large result
> set.seed(123L); length(filln(100000, fun, 4))
[1] 23101
We can benchmark the overhead, approximately, by comparing to something that knows in advance how much space is required
f1 <- function(n, FUN, ...) {
i <- 0L
result <- numeric(n)
while (i < n) {
i <- i + 1L
result[i] <- FUN(...)
}
result
}
Here we check the timing and value of a single result
> set.seed(123L); system.time(res0 <- filln(100000, fun, 4))
user system elapsed
0.944 0.000 0.948
> set.seed(123L); system.time(res1 <- f1(23101, fun, 4))
user system elapsed
0.688 0.000 0.689
> identical(res0, res1)
[1] TRUE
which for this example is of course eclipsed by the simple vector solution(s)
set.seed(123L); system.time(res2 <- rnorm(23101))
identical(res0, res2)
If you can't compute 1:bigBigNumber
, count the entries, create the vector, then populate it.
num <- 0L
i <- 0L
while (i < bigBigNumber) {
if (someCondition(i)) num <- num + 1L
i <- i + 1L
}
result <- integer(num)
num <- 0L
while (i < bigBigNumber) {
if (someCondition(i)) {
result[num] <- i
num <- num + 1L }
i <- i + 1L
}
(This code is not tested.)
If you can compute 1:bigBigBIGNumber
, this will also work:
I assume that you want to call a function, and not simply tack on the indices themselves. Something like this may be closer to what you want:
values <- seq(bigBigBIGNumber)
sapply(values[someCondition(values)], my_function)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With