Assume I have a preallocated data structure that I want to write into for the sake of performance vs. growing the data structure over time. First I tried this using sapply:
set.seed(1)
count <- 5
pre <- numeric(count)
sapply(1:count, function(i) {
pre[i] <- rnorm(1)
})
pre
# [1] 0 0 0 0 0
for(i in 1:count) {
pre[i] <- rnorm(1)
}
pre
# [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884
I assume this is because the anonymous function in sapply is in a different scope (or is it environment in R?) and as a result the pre object isn't the same. The for loop exists in the same scope/environment and so it works as expected.
I've generally tried to adopt the R mechanisms for iteration with apply functions vs. for, but I don't see a way around it here. Is there something different I should be doing or a better idiom for this type of operation?
As noted, my example is highly contrived, I have no interested in generaring normal deviates. Instead my actual code is dealing with a 4 column 1.5 million row dataframe. Previously I was relying on growing and merging to get a final dataframe and decided to try to avoid merges and preallocate based on benchmarking.
sapply isn't meant to be used like that. It already pre-allocates the result.
Regardless, the for loop is not likely the source of slow performance; it's probably because you're repeatedly subsetting a data.frame. For example:
set.seed(21)
N <- 1e4
d <- data.frame(n=1:N, s=sample(letters, N, TRUE))
l <- as.list(d)
set.seed(21)
system.time(for(i in 1:N) { d$n[i] <- rnorm(1); d$s <- sample(letters,1) })
# user system elapsed
# 6.12 0.00 6.17
set.seed(21)
system.time(for(i in 1:N) { l$n[i] <- rnorm(1); l$s <- sample(letters,1) })
# user system elapsed
# 0.14 0.00 0.14
D <- as.data.frame(l, stringsAsFactors=FALSE)
identical(d,D)
# [1] TRUE
So you should loop over individual vectors and combine them into a data.frame after the loop.
The apply family isn't intended for side-effect producing tasks, like changing the state of a variable. These functions are meant to simply return values, which you then assign to a variable. This is consistent with the functional paradigm that R partially subscribes to. If you're using these functions as intended, preallocation doesn't come up much, and that's part of their appeal. You could easily do this without preallocating: p <- sapply(1:count, function(i) rnorm(1)). But this example is a little artificial---p <- rnorm(5) is what you would use.
If your actual problem is different than this and you're having problems with efficiency, look into vapply. It's just like sapply, but allows you to specify the resulting data type, giving it a speed advantage. If that fails to help, check out the packages data.table or ff.
Yes, you are essentially changing a pre that is local to the anonymous function which will itself return the result of the last evaluation (a vector of length 1), hence sapply() returns the correct solution as a vector (because it accumulates the individual length 1 vectors) but it doesn't change the pre in the global workspace.
You can work round this by using the <<- operator:
set.seed(1)
count <- 5
pre <- numeric(count)
sapply(1:count, function(i) {
pre[i] <<- rnorm(1)
})
> pre
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Which has changed pre, but I would avoid doing this for various reasons.
In this case I don't think there is much to be gained from pre-allocating pre in the sapply() case.
Also, for this example both are terribly inefficient; just get rnorm() to generate count random numbers. But I guess the example was just to illustrate the point?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With