Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: apply vs do.call

Tags:

r

apply

do.call

I just read the profile of @David Arenburg, and found a bunch of useful tips for how to develop good R-programming skills/habits, and one especially struck me. I have always thought that the apply functions in R was the cornerstone of working with dataframes, but he writes:

If you are working with data.frames, forget there is a function called apply- whatever you do - don't use it. Especially with a margin of 1 (the only good usecase for this function is to operate over matrix columns- margin of 2).

Some good alternatives: ?do.call, ?pmax/pmin, ?max.col, ?rowSums/rowMeans/etc, the awesome matrixStats packages (for matrices), ?rowsum and many more

Could anybody explain this to me? Why are apply functions frowned upon?

like image 517
Helen Avatar asked Jun 06 '18 09:06

Helen


People also ask

Why use do call in R?

R has an interesting function called do. call. This function allows you to call any R function, but instead of writing out the arguments one by one, you can use a list to hold the arguments of the function. While it may not seem useful on the surface, a simple example will help to show how powerful do.

Do call vs Lapply?

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. do. call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

Do call in R studio?

call() function in R constructs and executes a function call from a name or a function as well as a list of arguments to be passed to it. In other words, the do. call() function allows us to call the R function using a list to hold the function's arguments instead of writing out the arguments.

How does Lapply work in R?

The lapply() function in the R Language takes a list, vector, or data frame as input and gives output in the form of a list object. Since the lapply() function applies a certain operation to all the elements of the list it doesn't need a MARGIN.


2 Answers

  • apply(DF, 1, f) converts each row of DF to a vector and then passes that vector to f. If DF were a mix of strings and numbers then the row would be converted to a character vector before passing it to f so that, for example, apply(iris, 1, function(x) sum(x[-5])) will not work even though the row iris[i, -5] contains all numeric elements. The row is converted to character string and you can't sum character strings. On the other hand apply(iris[-5], 1, sum) will work the same as rowSums(iris[-5]).

  • if f produces a vector the result is a matrix and not another data frame; also, the result is the transpose of what you might expect. This

    apply(BOD, 1, identity)
    

    gives the following rather than giving BOD back:

           [,1] [,2] [,3] [,4] [,5] [,6]
    Time    1.0  2.0    3    4  5.0  7.0
    demand  8.3 10.3   19   16 15.6 19.8
    

    Many years ago Hadley Wickham did post iapply which is idempotent in the sense that iapply(mat, 1, identity) returns mat, rather than t(mat), where mat is a matrix. More recently with his plyr package one can write:

    library(plyr)
    ddplyr(BOD, 1, identity)
    

    and get BOD back as a data frame.

On the other hand apply(BOD, 1, sum) will give the same result as rowSums(BOD) and apply(BOD, 1, f) might be useful for functions f for which f produces a scalar and there is no counterpart such as in the sum / rowSums case. Also if f produces a vector and you don't mind a matrix result you can transpose the output of apply yourself and although ugly it would work.

like image 84
G. Grothendieck Avatar answered Oct 06 '22 01:10

G. Grothendieck


I think what the author means, is that you should use pre-built/vectorized functions (because it is easier), if you can and avoid apply (because in principle it is a for loop and takes longer):

library(microbenchmark)

d <- data.frame(a = rnorm(10, 10, 1),
                b = rnorm(10, 200, 1))

# bad - loop
microbenchmark(apply(d, 1, function(x) if (x[1] < x[2]) x[1] else x[2]))

# good - vectorized but same result
microbenchmark(pmin(d[[1]], d[[2]])) # use double brackets!

# edited:
# -------
# bad: lapply
microbenchmark(data.frame(lapply(d, round, 1)))

# good: do.call faster than lapply
microbenchmark(do.call("round", list(d, digits = 1)))

# --------------
# Unit: microseconds
#                                  expr     min    lq     mean  median      uq     max neval
# do.call("round", list(d, digits = 1)) 104.422 107.1 148.3419 134.767 184.524 332.009   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100
#
#                                  expr    min      lq    mean median       uq     max neval
# do.call("round", list(d, digits = 1)) 96.389 97.5055 113.075 98.175 105.5375 730.954   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100
like image 24
r.user.05apr Avatar answered Oct 06 '22 02:10

r.user.05apr