Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]

Possible Duplicate:
Is R's apply family more than syntactic sugar

Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".

Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?

Thanks in advance!

Josh

like image 839
Josh Avatar asked Jun 23 '11 21:06

Josh


People also ask

Why is a function better than a loop?

Just as a loop is an embodiment of a piece of code we wish to have repeated, a function is an embodiment of a piece of code that we can run anytime just by calling it into action. A given loop construct, for instance could only be run once in its present location in the source code.

What is advantage of using apply family of functions in R?

Functions present in the apply family are the ones that allow us to manipulate data frames, arrays, matrices, vectors. These functions are alternative to the loops. However, are more efficient than loops as functions are faster at the execution level. These functions reduce the need for explicitly creating a loop in R.

Is apply function faster than for loop?

The apply functions do run a for loop in the background. However they often do it in the C programming language (which is used to build R). This does make the apply functions a few milliseconds faster than regular for loops.

Why it is advisable to use predefined functions in R instead of loops?

It is better to use one or more function calls within the loop if a loop is getting (too) big. The function calls will make it easier for other users to follow the code.


2 Answers

There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.

Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.

for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:

IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
    OUT[i] <- IN > 0.5
}

that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.

The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!

The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.

As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:

> lapply
function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<environment: namespace:base>

and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().

like image 82
Gavin Simpson Avatar answered Sep 29 '22 14:09

Gavin Simpson


From Burns' R Inferno (pdf), p25:

Use an explicit for loop when each iteration is a non-trivial task. But a simple loop can be more clearly and compactly expressed using an apply function. There is at least one exception to this rule ... if the result will be a list and some of the components can be NULL, then a for loop is trouble (big trouble) and lapply gives the expected answer.

like image 25
Richie Cotton Avatar answered Sep 29 '22 14:09

Richie Cotton