I just read the profile of @David Arenburg, and found a bunch of useful tips for how to develop good R-programming skills/habits, and one especially struck me. I have always thought that the apply functions in R was the cornerstone of working with dataframes, but he writes:
If you are working with data.frames, forget there is a function called apply- whatever you do - don't use it. Especially with a margin of 1 (the only good usecase for this function is to operate over matrix columns- margin of 2).
Some good alternatives: ?do.call, ?pmax/pmin, ?max.col, ?rowSums/rowMeans/etc, the awesome matrixStats packages (for matrices), ?rowsum and many more
Could anybody explain this to me? Why are apply functions frowned upon?
R has an interesting function called do. call. This function allows you to call any R function, but instead of writing out the arguments one by one, you can use a list to hold the arguments of the function. While it may not seem useful on the surface, a simple example will help to show how powerful do.
lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. do. call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
call() function in R constructs and executes a function call from a name or a function as well as a list of arguments to be passed to it. In other words, the do. call() function allows us to call the R function using a list to hold the function's arguments instead of writing out the arguments.
The lapply() function in the R Language takes a list, vector, or data frame as input and gives output in the form of a list object. Since the lapply() function applies a certain operation to all the elements of the list it doesn't need a MARGIN.
apply(DF, 1, f)
converts each row of DF
to a vector and then passes that vector to f. If DF
were a mix of strings and numbers then the row would be converted to a character vector before passing it to f
so that, for example, apply(iris, 1, function(x) sum(x[-5]))
will not work even though the row iris[i, -5]
contains all numeric elements. The row is converted to character string and you can't sum character strings. On the other hand apply(iris[-5], 1, sum)
will work the same as rowSums(iris[-5])
.
if f
produces a vector the result is a matrix and not another data frame; also, the result is the transpose of what you might expect. This
apply(BOD, 1, identity)
gives the following rather than giving BOD
back:
[,1] [,2] [,3] [,4] [,5] [,6]
Time 1.0 2.0 3 4 5.0 7.0
demand 8.3 10.3 19 16 15.6 19.8
Many years ago Hadley Wickham did post iapply
which is idempotent in the sense that iapply(mat, 1, identity)
returns mat
, rather than t(mat)
, where mat
is a matrix. More recently with his plyr package one can write:
library(plyr)
ddplyr(BOD, 1, identity)
and get BOD
back as a data frame.
On the other hand apply(BOD, 1, sum)
will give the same result as rowSums(BOD)
and apply(BOD, 1, f)
might be useful for functions f
for which f
produces a scalar and there is no counterpart such as in the sum
/ rowSums
case. Also if f
produces a vector and you don't mind a matrix result you can transpose the output of apply
yourself and although ugly it would work.
I think what the author means, is that you should use pre-built/vectorized functions (because it is easier), if you can and avoid apply (because in principle it is a for loop and takes longer):
library(microbenchmark)
d <- data.frame(a = rnorm(10, 10, 1),
b = rnorm(10, 200, 1))
# bad - loop
microbenchmark(apply(d, 1, function(x) if (x[1] < x[2]) x[1] else x[2]))
# good - vectorized but same result
microbenchmark(pmin(d[[1]], d[[2]])) # use double brackets!
# edited:
# -------
# bad: lapply
microbenchmark(data.frame(lapply(d, round, 1)))
# good: do.call faster than lapply
microbenchmark(do.call("round", list(d, digits = 1)))
# --------------
# Unit: microseconds
# expr min lq mean median uq max neval
# do.call("round", list(d, digits = 1)) 104.422 107.1 148.3419 134.767 184.524 332.009 100
# expr min lq mean median uq max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265 100
#
# expr min lq mean median uq max neval
# do.call("round", list(d, digits = 1)) 96.389 97.5055 113.075 98.175 105.5375 730.954 100
# expr min lq mean median uq max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With