Assume I have a preallocated data structure that I want to write into for the sake of performance vs. growing the data structure over time. First I tried this using sapply:
set.seed(1)
count <- 5
pre <- numeric(count)
sapply(1:count, function(i) {
pre[i] <- rnorm(1)
})
pre
# [1] 0 0 0 0 0
for(i in 1:count) {
pre[i] <- rnorm(1)
}
pre
# [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884
I assume this is because the anonymous function in sapply
is in a different scope (or is it environment in R?) and as a result the pre
object isn't the same. The for loop exists in the same scope/environment and so it works as expected.
I've generally tried to adopt the R mechanisms for iteration with apply functions vs. for, but I don't see a way around it here. Is there something different I should be doing or a better idiom for this type of operation?
As noted, my example is highly contrived, I have no interested in generaring normal deviates. Instead my actual code is dealing with a 4 column 1.5 million row dataframe. Previously I was relying on growing and merging to get a final dataframe and decided to try to avoid merges and preallocate based on benchmarking.
sapply
isn't meant to be used like that. It already pre-allocates the result.
Regardless, the for loop is not likely the source of slow performance; it's probably because you're repeatedly subsetting a data.frame. For example:
set.seed(21)
N <- 1e4
d <- data.frame(n=1:N, s=sample(letters, N, TRUE))
l <- as.list(d)
set.seed(21)
system.time(for(i in 1:N) { d$n[i] <- rnorm(1); d$s <- sample(letters,1) })
# user system elapsed
# 6.12 0.00 6.17
set.seed(21)
system.time(for(i in 1:N) { l$n[i] <- rnorm(1); l$s <- sample(letters,1) })
# user system elapsed
# 0.14 0.00 0.14
D <- as.data.frame(l, stringsAsFactors=FALSE)
identical(d,D)
# [1] TRUE
So you should loop over individual vectors and combine them into a data.frame after the loop.
The apply
family isn't intended for side-effect producing tasks, like changing the state of a variable. These functions are meant to simply return values, which you then assign to a variable. This is consistent with the functional paradigm that R partially subscribes to. If you're using these functions as intended, preallocation doesn't come up much, and that's part of their appeal. You could easily do this without preallocating: p <- sapply(1:count, function(i) rnorm(1))
. But this example is a little artificial---p <- rnorm(5)
is what you would use.
If your actual problem is different than this and you're having problems with efficiency, look into vapply
. It's just like sapply
, but allows you to specify the resulting data type, giving it a speed advantage. If that fails to help, check out the packages data.table
or ff
.
Yes, you are essentially changing a pre
that is local to the anonymous function which will itself return the result of the last evaluation (a vector of length 1), hence sapply()
returns the correct solution as a vector (because it accumulates the individual length 1 vectors) but it doesn't change the pre
in the global workspace.
You can work round this by using the <<-
operator:
set.seed(1)
count <- 5
pre <- numeric(count)
sapply(1:count, function(i) {
pre[i] <<- rnorm(1)
})
> pre
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Which has changed pre
, but I would avoid doing this for various reasons.
In this case I don't think there is much to be gained from pre-allocating pre
in the sapply()
case.
Also, for this example both are terribly inefficient; just get rnorm()
to generate count
random numbers. But I guess the example was just to illustrate the point?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With