Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mutate with dplyr using multiple conditions

Tags:

r

dplyr

I have a data frame (df) below and I want to add an additional column, result, using dplyr that will take on the value 1 if z == "gone" and where x is the maximum value for group y.

   y  x    z
1  a  3 gone
2  a  5 gone
3  a  8 gone
4  a  9 gone
5  a 10 gone
6  b  1     
7  b  2     
8  b  4     
9  b  6     
10 b  7     

If I were to simply select the maximum for each group it would be:

df %>%
  group_by(y) %>%
  slice(which.max(x))

which will return:

   y  x  z
1  a 10  gone
2  b  7      

This is not what I want. I need to take advantage of the max value of x for each group in y while checking to see if z == "gone", and if TRUE 1 otherwise 0. This would look like:

   y  x    z result
1  a  3 gone      0
2  a  5 gone      0
3  a  8 gone      0
4  a  9 gone      0
5  a 10 gone      1
6  b  1           0
7  b  2           0
8  b  4           0
9  b  6           0
10 b  7           0

I'm assuming I would use a conditional statement within mutate() but I cannot seem to find an example. Please advise.

like image 720
Ryan Erwin Avatar asked Oct 08 '15 03:10

Ryan Erwin


2 Answers

With dplyr you can use:

df %>% group_by(y) %>% mutate(result = +(x == max(x) & z == 'gone'))

The +(..) notation is shorthand for as.integer to coerce the logical output to 1's and 0's. Some don't like it so it's a matter of shorter code versus readability. Efficiency gains can be debated on the circumstance.

Also to appreciate what data.table and dplyr have done for data manipulation with R, let's do the same thing in the old-fashioned "split-apply-combine" way:

#split data.frame by group
split.df <- split(df, df$y)

#apply required function to each group
lst <- lapply(split.df, function(dfx) {
        dfx$result <- +(dfx$x == max(dfx$x) & dfx$z == "gone")
        dfx})

#combine result in new data.frame
newdf <- do.call(rbind, lst)
like image 91
Pierre L Avatar answered Sep 28 '22 01:09

Pierre L


We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'y', we create the logical condition for maximum value of 'x' and the 'gone' element in 'z', coerce it to 'integer' (as.integer) and assign (:=) the output to the new column ('result').

library(data.table)
setDT(df)[, result := as.integer(x==max(x) & z=='gone') , by = y]
df
#    y  x    z result
# 1: a  3 gone      0
# 2: a  5 gone      0
# 3: a  8 gone      0
# 4: a  9 gone      0
# 5: a 10 gone      1
# 6: b  1           0
# 7: b  2           0
# 8: b  4           0
# 9: b  6           0
#10: b  7           0

Or we can use ave from base R

df$result <- with(df, +(ave(x, y, FUN=max)==x & z=='gone' ))
like image 43
akrun Avatar answered Sep 28 '22 01:09

akrun