Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple condition if-else using dplyr, custom function, or purr

Tags:

r

dplyr

purrr

I have a data frame that has similar structure to the following:

set.seed(123)  
df<-data_frame(SectionName = rep(letters[1:2], 50),
               TimeSpentSeconds = sample(0:360, 100, replace = TRUE),
               Correct = sample(0:1, 100, replace = TRUE))

I want summarise this data frame by taking all values of TimeSpentSeconds that fall into certain ranges (less than 30, between 30-60, between 60-90, ..., greater than 180), label the times as those ranges, group them by SectionName, and find the sum of the Correct column so that the resulting data frame looks (something) like this:

    TimeGroup             SectionName Correct
   <fct>                 <chr>         <int>
 1 LessThan30Secs        a                 2
 2 LessThan30Secs        b                 3
 3 30-60 Seconds         a                 4
 4 30-60 Seconds         b                 3
 5 60-90 Seconds         a                 2
 6 60-90 Seconds         b                 3
 7 90-120 Seconds        a                 4
 8 90-120 Seconds        b                 0
 9 120-150 Seconds       a                 4
10 120-150 Seconds       b                 0
11 150-180 Seconds       a                 1
12 150-180 Seconds       b                 2
13 GreaterThan180Seconds a                11
14 GreaterThan180Seconds b                11

I was able to successfully do this with the following if-else code where I mutated all of the times into a new column with the appropriate label, grouped, and summarised:

x <- c("LessThan30Secs", "30-60 Seconds", "60-90 Seconds","90-120 Seconds", 
           "120-150 Seconds", "150-180 Seconds", "GreaterThan180Seconds") 

df %>% 
mutate(TimeGroup = if_else(TimeSpentSeconds >= 0 & TimeSpentSeconds <= 30, "LessThan30Secs",
                if_else(TimeSpentSeconds > 30 & TimeSpentSeconds <= 60, "30-60 Seconds",
                if_else(TimeSpentSeconds > 60 & TimeSpentSeconds <= 90, "60-90 Seconds",
                if_else(TimeSpentSeconds > 90 & TimeSpentSeconds <= 120, "90-120 Seconds",
                if_else(TimeSpentSeconds > 120 & TimeSpentSeconds <= 150, "120-150 Seconds", 
                if_else(TimeSpentSeconds > 150 & TimeSpentSeconds <= 180, "150-180 Seconds",
                if_else(TimeSpentSeconds > 180, "GreaterThan180Seconds", "")))))))) %>%
    mutate(TimeGroup = factor(TimeGroup, levels = x)) %>%
    arrange(TimeGroup) %>%
    group_by(TimeGroup, SectionName) %>%
    summarise(Correct = sum(Correct))

But, there just has to be a better way to do this. I considered writing a function, but didn't get very far as I'm not great at function writing.

Does anyone have any ideas on a more elegant way to accomplish this same output through a dplyr method I didn't think of, writing a custom function maybe utilizing the purrr package at some point, or some other r function?

like image 856
KyleF Avatar asked Aug 26 '18 17:08

KyleF


2 Answers

case_when() will do what you want. Its a tidy alternative to nested ifelse() statements.

library(dplyr)

mutate(df,
       TimeGroup = case_when(
         TimeSpentSeconds <= 30 ~ "30 Seconds or less",
         TimeSpentSeconds <= 60 ~ "31-60 Seconds",
         TimeSpentSeconds <= 90 ~ "61-90 Seconds",
         TimeSpentSeconds <= 120 ~ "91-120 Seconds",
         TimeSpentSeconds <= 150 ~ "121-150 Seconds", 
         TimeSpentSeconds <= 180 ~ "151-180 Seconds",
         TimeSpentSeconds > 180 ~ "Greater Than 180 Seconds",
         TRUE ~ NA_character_)
)

The last argument is a catch all for records that don't fit any of the criteria, such as if time is somehow less than 0 seconds.

like image 76
Paul Avatar answered Nov 27 '22 15:11

Paul


We can do this easily with cut (or findInterval) instead of multiple nested ifelse statements

lbls <- c('LessThan30secs', '30-60 Seconds', '60-90 Seconds', 
    '90-120 Seconds', '120-150 Seconds', '150-180 Seconds', 'GreaterThan180Seconds')
df %>% 
    group_by(TimeGroup = cut(TimeSpentSeconds, 
                 breaks = c(seq(0, 180, by = 30), Inf), labels = lbls), 
            SectionName) %>%
   summarise(Correct = sum(Correct)) %>%
   na.omit
like image 24
akrun Avatar answered Nov 27 '22 14:11

akrun