I know there are several similar questions around here, but none of them seems to address the precise issue I'm having.
set.seed(4)
df = data.frame(
Key = c("A", "B", "A", "D", "A"),
Val1 = rnorm(5),
Val2 = runif(5),
Val3 = 1:5
)
I want to zeroise values of the value columns for the rows where Key == "A"
The column names are referenced through a grep
:
cols = grep("Val", names(df), value = TRUE)
Normally to achieve what I want in this case I would use data.table
like this:
library(data.table)
df = as.data.table(df)
df[Key == "A", (cols) := 0]
And the desired output is like this:
Key Val1 Val2 Val3
1 A 0.000000 0.00000000 0
2 B -1.383814 0.55925762 2
3 A 0.000000 0.00000000 0
4 D 1.437151 0.05632773 4
5 A 0.000000 0.00000000 0
However this time I need to use dplyr
as I am working on a team project where everyone uses it. The data I just provided is illustrative and my real data is >5m rows with 16 value columns to be updated. The only solution I could come up with is using mutate_at
like this:
df %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(df$Key == "A", 0, x))
However, this seems to be extremely slow on my real data. I was hoping to find a solution which is more elegant and, more importantly, faster.
I have tried many combinations using map
, unquoting using !!
, using get
and :=
(which annoyingly can get masked by the :=
in data.table) etc, but I think my understanding of how these work is simply not deep enough to construct a valid solution.
With this dplyr command,
df %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(df$Key == "A", 0, x))
You are actually evaluating the statement df$Key == "A", n times, where n=the number of columns you have.
One work around is to pre-define the rows you want to change:
idx = which(DF$Key=="A")
DF %>% mutate_at(.vars = vars(cols), .funs = function(x){x[idx]=0;x})
A cleaner and better way, correctly pointed out by @IceCreamToucan (see comments below), is to use the function replace, while passing it the extra parameters:
DF %>% mutate_at(.vars = vars(cols), replace, DF$Key == 'A', 0)
We can put all these approaches to test, and I think dplyr and data.table are comparable.
#simulate data
set.seed(100)
Key = sample(LETTERS[1:3],1000000,replace=TRUE)
DF = as.data.frame(data.frame(Key,matrix(runif(1000000*10),nrow=1000000,ncol=10)))
DT = as.data.table(DF)
cols = grep("[35789]", names(DF), value = TRUE)
#long method
system.time(DF %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(DF$Key == "A", 0, x)))
user system elapsed
0.121 0.035 0.156
#old base R way
system.time(DF[idx,cols] <- 0)
user system elapsed
0.085 0.021 0.106
#dplyr
# define function
func = function(){
idx = which(DF$Key=="A")
DF %>% mutate_at(.vars = vars(cols), .funs = function(x){x[idx]=0;x})
}
system.time(func())
user system elapsed
0.020 0.006 0.026
#data.table
system.time(DT[Key=="A", (cols) := 0])
user system elapsed
0.012 0.001 0.013
#replace with dplyr
system.time(DF %>% mutate_at(.vars = vars(cols), replace, DF$Key == 'A', 0))
user system elapsed
0.007 0.001 0.008
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With