Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Why is R dplyr::mutate inconsistent with custom functions




This question is a "why", not a how. In the following code I'm trying to understand why dplyr::mutate evaluates one custom function (f()) with the entire vector but not with the other custom function (g()). What exactly is mutate doing?

set.seed(1);sum(rnorm(100, c(0, 10, 100)))
f=function(m) {
    sum(rnorm(100, mean=m))
g <- function(m) sin(m)
df <- data.frame(a=c(0, 10, 100))
y1 <- mutate(df, asq=a^2, fout=f(a), gout=g(a))
y2 <- rowwise(df) %>%
    mutate(asq=a^2, fout=f(a), gout=g(a))
y3 <- group_by(df, a) %>%
    summarize(asq=a^2, fout=f(a), gout=g(a))

For all three columns, asq, fout, and gout, evaluation is rowwise in y2 and y3 and the results are identical. However, y1$fout is 3640.889 for all three rows, which is the result of evaluating sum(rnorm(100, c(0, 10, 100))). So the function f() is evaluating the entire vector for each row.

A closely related question has been asked elsewhere mutate/transform in R dplyr (Pass custom function), but the "why" was not explained.

like image 220
Robert McDonald Avatar asked Apr 22 '18 15:04

Robert McDonald

Video Answer

2 Answers

sin and ^ are vectorized, so they natively operate on each individual value, rather than on the whole vector of values. f is not vectorized. But you can do f = Vectorize(f) and it will operate on each individual value as well.

y1 <- mutate(df, asq=a^2, fout=f(a), gout=g(a))
    a   asq     fout       gout
1   0     0 3640.889  0.0000000
2  10   100 3640.889 -0.5440211
3 100 10000 3640.889 -0.5063656
f = Vectorize(f)

y1a <- mutate(df, asq=a^2, fout=f(a), gout=g(a))
    a   asq        fout       gout
1   0     0    10.88874  0.0000000
2  10   100  1010.88874 -0.5440211
3 100 10000 10010.88874 -0.5063656

Some additional info on vectorization here, here, and here.

like image 94
eipi10 Avatar answered Sep 30 '22 18:09


We can loop through each element of 'a' using map and apply the function f

df %>%
    mutate(asq = a^2, fout = map_dbl(a, f), gout = g(a)) 
#    a   asq        fout       gout
#1   0     0    10.88874  0.0000000
#2  10   100  1010.88874 -0.5440211
#3 100 10000 10010.88874 -0.5063656
like image 45
akrun Avatar answered Sep 30 '22 18:09
