I have a data.frame that has 100 variables. I want to get the sum of three variables only using mutate
(not summarise
).
If there is NA in any of the 3 variables, I still want to get the sum
. In order to do this using mutate
, I replaced all NA
values with 0
using ifelse
then I got the sum
.
library(dplyr)
df %>% mutate(mod_var1 = ifelse(is.na(var1), 0, var1),
mod_var2 = ifelse(is.na(var2), 0, var2),
mod_var3 = ifelse(is.na(var3), 0, var3),
sum = (mod_var1+mod_var2+mod_var3))
Is there any better (shorter) way to do this?
DATA
df <- read.table(text = c("
var1 var2 var3
4 5 NA
2 NA 3
1 2 4
NA 3 5
3 NA 2
1 1 5"), header =T)
mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name.
To use mutate in R, all you need to do is call the function, specify the dataframe, and specify the name-value pair for the new variable you want to create.
In R programming, the mutate function is used to create a new variable from a data set. In order to use the function, we need to install the dplyr package, which is an add-on to R that includes a host of cool functions for selecting, filtering, grouping, and arranging data.
rowwise()
is my go-to function. It's like group_by()
but it treats each row as an individual group.
df %>% rowwise() %>% mutate(Sum = sum(c(var1, var2, var3), na.rm = TRUE))
We can use Reduce
with +
df %>%
mutate_each(funs(replace(., is.na(.), 0)), var1:var3) %>%
mutate(Sum = Reduce(`+`, .))
# var1 var2 var3 Sum
#1 4 5 0 9
#2 2 0 3 5
#3 1 2 4 7
#4 0 3 5 8
#5 3 0 2 5
#6 1 1 5 7
Or with rowSums
df %>%
mutate(Sum = rowSums(.[names(.)[1:3]], na.rm = TRUE))
# var1 var2 var3 Sum
#1 4 5 NA 9
#2 2 NA 3 5
#3 1 2 4 7
#4 NA 3 5 8
#5 3 NA 2 5
#6 1 1 5 7
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e6 *3, replace=TRUE),
dimnames = list(NULL, paste0("var", 1:3)), ncol=3))
system.time({
df1 %>% rowwise() %>% mutate(Sum = sum(c(var1, var2, var3), na.rm = TRUE))
})
# user system elapsed
# 21.50 0.03 21.66
system.time({
df1 %>%
mutate(rn = row_number()) %>%
gather(var, varNum, var1:var3) %>%
group_by(rn) %>%
mutate(sum = sum(varNum, na.rm = TRUE)) %>%
spread(var, varNum)})
# user system elapsed
# 5.96 0.39 6.37
system.time({
replace(df1, is.na(df1), 0) %>% mutate(sum = var1 + var2 + var3)
})
# user system elapsed
# 0.17 0.01 0.19
system.time({
df1 %>%
mutate_each(funs(replace(., is.na(.), 0)), var1:var3) %>%
mutate(Sum = Reduce(`+`, .))
})
# user system elapsed
# 0.10 0.02 0.11
system.time({
df1 %>%
mutate(Sum = rowSums(.[names(.)[1:3]], na.rm = TRUE))
})
# user system elapsed
# 0.04 0.00 0.03
Where better = tidyr
:
df %>%
mutate(rn = row_number()) %>%
gather(var, varNum, var1:var3) %>%
group_by(rn) %>%
mutate(sum = sum(varNum, na.rm = TRUE)) %>%
spread(var, varNum)
In case your dataset is poised to grow...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With