Can dplyr package be used for conditional mutating?

People also ask

What package do you need for mutate in R?

In R programming, the mutate function is used to create a new variable from a data set. In order to use the function, we need to install the dplyr package, which is an add-on to R that includes a host of cool functions for selecting, filtering, grouping, and arranging data.

What does mutate in dplyr do?

mutate() is a dplyr function that adds new variables and preserves existing ones. That's what the documentation says. So when you want to add new variables or change one already in the dataset, that's your good ally.

How do I create a mutate variable in R?

To use mutate in R, all you need to do is call the function, specify the dataframe, and specify the name-value pair for the new variable you want to create.

Use ifelse

df %>%
  mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               ifelse(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA)))

Added - if_else: Note that in dplyr 0.5 there is an if_else function defined so an alternative would be to replace ifelse with if_else; however, note that since if_else is stricter than ifelse (both legs of the condition must have the same type) so the NA in that case would have to be replaced with NA_real_ .

df %>%
  mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               if_else(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA_real_)))

Added - case_when Since this question was posted dplyr has added case_when so another alternative would be:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,
                            TRUE ~ NA_real_))

Added - arithmetic/na_if If the values are numeric and the conditions (except for the default value of NA at the end) are mutually exclusive, as is the case in the question, then we can use an arithmetic expression such that each term is multiplied by the desired result using na_if at the end to replace 0 with NA.

df %>%
  mutate(g = 2 * (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)) +
             3 * (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
         g = na_if(g, 0))

Since you ask for other better ways to handle the problem, here's another way using data.table:

require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]

Note the order of conditional statements is reversed to get g correctly. There's no copy of g made, even during the second assignment - it's replaced in-place.

On larger data this would have better performance than using nested if-else, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.

Here's a benchmark on relatively bigger data:

# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT) 
# [1] 10000000        6
DF <- as.data.frame(DT)

DT_fun <- function(DT) {
    DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
    DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
    mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

BASE_fun <- function(DF) { # R v3.1.0
    transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

system.time(ans1 <- DT_fun(DT))
#   user  system elapsed 
#  2.659   0.420   3.107 

system.time(ans2 <- DPLYR_fun(DF))
#   user  system elapsed 
# 11.822   1.075  12.976 

system.time(ans3 <- BASE_fun(DF))
#   user  system elapsed 
# 11.676   1.530  13.319 

identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE

identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE

Not sure if this is an alternative you'd asked for, but I hope it helps.

dplyr now has a function case_when that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor.

df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                     a %in% c(0,1,3,4) | c == 4 ~ 3L, 
                     TRUE~as.integer(NA)))

EDIT: If you're using dplyr::case_when() from before version 0.7.0 of the package, then you need to precede variable names with '.$' (e.g. write .$a == 1 inside case_when).

Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:

require(data.table) 
require(mosaic) 
require(dplyr)
require(microbenchmark)

set.seed(42) # To recreate the dataframe
DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)

DPLYR_case_when <- function(DF) {
  DF %>%
  mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                       a %in% c(0,1,3,4) | c==4 ~ 3L, 
                       TRUE~as.integer(NA)))
}

DT_fun <- function(DT) {
  DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
  DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
  mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
                    ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

mosa_fun <- function(DF) {
  mutate(DF, g = derivedFactor(
    "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
    "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
    .method = "first",
    .default = NA
  ))
}

perf_results <- microbenchmark(
  dt_fun <- DT_fun(copy(DT)),
  dplyr_ifelse <- DPLYR_fun(copy(DF)),
  dplyr_case_when <- DPLYR_case_when(copy(DF)),
  mosa <- mosa_fun(copy(DF)),
  times = 100L
)

This gives:

print(perf_results)
Unit: milliseconds
           expr        min         lq       mean     median         uq        max neval
         dt_fun   1.391402    1.560751   1.658337   1.651201   1.716851   2.383801   100
   dplyr_ifelse   1.172601    1.230351   1.331538   1.294851   1.390351   1.995701   100
dplyr_case_when   1.648201    1.768002   1.860968   1.844101   1.958801   2.207001   100
           mosa 255.591301  281.158350 291.391586 286.549802 292.101601 545.880702   100

The derivedFactor function from mosaic package seems to be designed to handle this. Using this example, it would look like:

library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
     "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
     "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
     .method = "first",
     .default = NA
     ))

(If you want the result to be numeric instead of a factor, you can wrap derivedFactor in an as.numeric call.)

derivedFactor can be used for an arbitrary number of conditionals, too.

case_when is now a pretty clean implementation of the SQL-style case when:

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 
8L), class = "data.frame") -> df


df %>% 
    mutate( g = case_when(
                a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 )     ~   2,
                a == 0 | a == 1 | a == 4 |  a == 3 | c == 4       ~   3
))

Using dplyr 0.7.4

The manual: http://dplyr.tidyverse.org/reference/case_when.html

Related questions
                            
                                Change size of axes title and labels in ggplot2
                            
                                Fixing a multiple warning "unknown column"
                            
                                Error: could not find function ... in R
                            
                                Relative frequencies / proportions with dplyr
                            
                                How can I handle R CMD check "no visible binding for global variable" notes when my ggplot2 syntax is sensible?
                            
                                How to prevent ifelse() from turning Date objects into numeric objects
                            
                                How to suppress warnings globally in an R Script
                            
                                Reasons for using the set.seed function
                            
                                How to use R's ellipsis feature when writing your own function?
                            
                                How to assign colors to categorical variables in ggplot2 that have stable mapping?
                            
                                Mean per group in a data.frame [duplicate]
                            
                                How to select a CRAN mirror in R
                            
                                "Correct" way to specifiy optional arguments in R functions
                            
                                Load multiple packages at once
                            
                                Speed up the loop operation in R
                            
                                Numbering rows within groups in a data frame
                            
                                Why use purrr::map instead of lapply?
                            
                                remove kernel on jupyter notebook
                            
                                Plot a legend outside of the plotting area in base graphics?
                            
                                Reshaping data.frame from wide to long format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can dplyr package be used for conditional mutating?

Tags:

r

if-statement

dplyr

case-when

People also ask

Recent Activity

Donate For Us