Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
Just as a loop is an embodiment of a piece of code we wish to have repeated, a function is an embodiment of a piece of code that we can run anytime just by calling it into action. A given loop construct, for instance could only be run once in its present location in the source code.
Functions present in the apply family are the ones that allow us to manipulate data frames, arrays, matrices, vectors. These functions are alternative to the loops. However, are more efficient than loops as functions are faster at the execution level. These functions reduce the need for explicitly creating a loop in R.
The apply functions do run a for loop in the background. However they often do it in the C programming language (which is used to build R). This does make the apply functions a few milliseconds faster than regular for loops.
It is better to use one or more function calls within the loop if a loop is getting (too) big. The function calls will make it easier for other users to follow the code.
There are several reasons why one might prefer an apply
family function over a for
loop, or vice-versa.
Firstly, for()
and apply()
, sapply()
will generally be just as quick as each other if executed correctly. lapply()
does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply()
. In the end, these all will be calling R functions so they need to be interpreted and then run.
for()
loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply
family functions. However, to use for()
loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply
functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as >
is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for()
loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply
family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply
family lend themselves to scalar or vector operations. A for()
loop will often lend itself to doing multiple iterated operations using the same index i
. For example, I have written code that uses for()
loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply
family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply()
can possibly be faster that for()
or apply()
, you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()
) then that is where the performance gain can come from over apply()
say which boils down to a for()
loop in actual R code. See the source for apply()
to see that it is a wrapper around a for()
loop, and then look at the code for lapply()
, which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply()
and for()
and the other apply
family functions. The .Internal()
is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN
, the entire computation is done in C, calling the R function FUN
. Compare that with the source for apply()
.
From Burns' R Inferno (pdf), p25:
Use an explicit
for
loop when each iteration is a non-trivial task. But a simple loop can be more clearly and compactly expressed using anapply
function. There is at least one exception to this rule ... if the result will be a list and some of the components can beNULL
, then a for loop is trouble (big trouble) andlapply
gives the expected answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With