At one stage in longer chain of dplyr
functions, I need to replace parts of a variable using numeric indices to specify which elements to replace.
My data looks like this:
df1 <- data.frame(grp = rep(1:2, each = 3),
a = 1:6,
b = rep(c(10, 20), each = 3))
df1
# grp a b
# 1 1 1 10
# 2 1 2 10
# 3 1 3 10
# 4 2 4 20
# 5 2 5 20
# 6 2 6 20
Assume that we, within each group, wish to replace elements in variable a
with the corresponding elements in b
, at one or more positions. In this simple example I use a single index (id
), but this could be a vector of indices. First, here's how I would do it with ddply
:
library(plyr)
id <- 2
ddply(.data = df1, .variables = .(grp), function(x){
x$a[id] <- x$b[id]
x
})
# grp a b
# 1 1 1 10
# 2 1 10 10
# 3 1 3 10
# 4 2 4 20
# 5 2 20 20
# 6 2 6 20
In dplyr
I could think of some different ways to perform the replacement. (1) Use do
with an anonymous function, similar to the one used in ddply
. (2) Use mutate
: concatenate a vector where the replacement is 'inserted' using numeric indexing. This is probably only fruitful for a single index. (3) Use mutate
: create an index vector and use conditional replacement with ifelse
(see e.g. here, here, here, and here).
detach("package:plyr", unload = TRUE)
library(dplyr)
# (1)
fun_do <- function(df){
l <- df %.%
group_by(grp) %.%
do(function(dat){
dat$a[id] <- dat$b[id]
dat
})
do.call(rbind, l)
}
# (2)
fun_mut <- function(df){
df %.%
group_by(grp) %.%
mutate(
a = c(a[1:(id - 1)], b[id], a[(id + 1):length(a)])
)
}
# (3)
fun_mut_ifelse <- function(df){
df %.%
group_by(grp) %.%
mutate(
idx = 1:n(),
a = ifelse(idx %in% id, b, a)) %.%
select(-idx)
}
fun_do(df1)
fun_mut(df1)
fun_mut_ifelse(df1)
In a benchmark with a slightly larger data set, the 'jigsaw puzzle insertion' is fastest, but again, this method is probably only suited for single replacements. And it doesn't look very clean...
set.seed(123)
df2 <- data.frame(grp = rep(1:200, each = 3),
a = rnorm(600),
b = rnorm(600))
library(microbenchmark)
microbenchmark(fun_do(df2),
fun_mut(df2),
fun_mut_ifelse(df2),
times = 10)
# Unit: microseconds
# expr min lq median uq max neval
# fun_do(df2) 48443.075 49912.682 51356.631 53369.644 55108.769 10
# fun_mut(df2) 891.420 933.996 1019.906 1066.663 1155.235 10
# fun_mut_ifelse(df2) 2503.579 2667.798 2869.270 3027.407 3138.787 10
Just to check the influence of the do.call(rbind
part in the do
function, try without it:
fun_do2 <- function(df){
df %.%
group_by(grp) %.%
do(function(dat){
dat$a[2] <- dat$b[2]
dat
})
}
fun_do2(df1)
Then a new benchmark on a larger data set:
df3 <- data.frame(grp = rep(1:2000, each = 3),
a = rnorm(6000),
b = rnorm(6000))
microbenchmark(fun_do(df3),
fun_do2(df3),
fun_mut(df3),
fun_mut_ifelse(df3),
times = 10)
Again, a simple 'insertion' is fastest, while the do
function is losing ground. In the help text do
is described as "a general purpose complement" to the other dplyr
functions. To me it seemed to be a natural choice for an anonymous function. However, I was surprised that do
was so much slower, also when the non-dplyr
rbind
ing part was skipped. Currently, the do
documentation is rather scarce, so I wonder if I am abusing the function, and that there may be more appropriate (undocumented?) ways to do
it?
I got no hits on index/indices when I searched the dplyr
help text or vignette. So now I wonder:
Are there other dplyr
methods to replace parts of a variable using numeric indices which I have overlooked? Specifically, is the creation of an index column in combination with ifelse
the way to go, or are there more direct a[i] <- b[i]
-like alternatives?
Edit following comment from @G.Grothendieck (Thanks!). Added replace
alternative (a candidate for 'See also' in ?[
).
fun_replace <- function(df){
df %.%
group_by(grp) %.%
mutate(
a = replace(a, id, b[id]))
}
fun_replace(df1)
microbenchmark(fun_do(df3),
fun_do2(df3),
fun_mut(df3),
fun_mut_ifelse(df3),
fun_replace(df3),
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun_do(df3) 685.154605 693.327160 706.055271 712.180410 851.757790 10
# fun_do2(df3) 291.787455 294.047747 297.753888 299.624730 302.368554 10
# fun_mut(df3) 5.736640 5.883753 6.206679 6.353222 7.381871 10
# fun_mut_ifelse(df3) 24.321894 26.091049 29.361553 32.649924 52.981525 10
# fun_replace(df3) 4.616757 4.748665 4.981689 5.279716 5.911503 10
replace
function is fastest, and for sure easier to use than fun_mut
when there are more than one index.
Edit 2 fun_do
and fun_do2
no longer works in dplyr 0.2
; Error: Results are not data frames at positions:
Here's a much faster modify-in-place approach:
library(data.table)
# select rows we want, then assign b to a for those rows, in place
fun_dt = function(dt) dt[dt[, .I[id], by = grp]$V1, a := b]
# benchmark
df4 = data.frame(grp = rep(1:20000, each = 3),
a = rnorm(60000),
b = rnorm(60000))
dt4 = as.data.table(df4)
library(microbenchmark)
# using fastest function from OP
microbenchmark(fun_dt(dt4), fun_replace(df4), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_dt(dt4) 15.62325 17.22828 18.42445 20.83768 21.25371 10
# fun_replace(df4) 99.03505 107.31529 116.74830 188.89134 286.50199 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With