Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grouped case_when returns duplicate rows

Tags:

r

dplyr

I am grouping by an identifier (the identifier may have multiple rows) and then attempting to make a unique row per identifier by using a group_by + summarize with a case_when. The idea is if there are 2 or more rows then return a single preset value otherwise return the original value. The problem is that case_when returns duplicate rows with the preset value. How do I make sure the output is unique? I could use a distinct but my expectation would be that the value is already unique. dplyr v. 1.0.7

MWE

library(tidyverse)

dat <- tibble(
    id = c(1, 1, 2, 3, 4, 4, 5),
    fav_color = c('red', 'blue', 'red', 'green', 'white', 'black', 'pink')
) 


dat %>%
    group_by(id) %>%
    summarize(
        fav_color = case_when(
            n() > 1 ~ 'Multiple Colors',
            n() == 1 ~ fav_color
        )
    )

Returns

# A tibble: 7 x 2
# Groups:   id [5]
     id fav_color      
  <dbl> <chr>          
1     1 Multiple Colors
2     1 Multiple Colors
3     2 red            
4     3 green          
5     4 Multiple Colors
6     4 Multiple Colors
7     5 pink    

What I want ifelse gives:

dat %>%
    group_by(id) %>%
    summarize(
        fav_color = ifelse(n() > 1, 'Multiple Colors', fav_color)
    )
# A tibble: 5 x 2
     id fav_color      
  <dbl> <chr>          
1     1 Multiple Colors
2     2 red            
3     3 green          
4     4 Multiple Colors
5     5 pink 

Note Using a character output for the final return results in the correct number of rows:

dat %>%
    group_by(id) %>%
    summarize(
        fav_color = case_when(
            n() > 1 ~ 'Multiple Colors',
            n() == 1 ~ 'Single Color'
        )
    )
# A tibble: 5 x 2
     id fav_color      
  <dbl> <chr>          
1     1 Multiple Colors
2     2 Single Color   
3     3 Single Color   
4     4 Multiple Colors
5     5 Single Color 
like image 618
Tyler Rinker Avatar asked Sep 19 '25 21:09

Tyler Rinker


2 Answers

The issue with case_when is that it evaluates all the outputs for all the values even when the condition is not satisfied.

Consider this example -

library(dplyr)

a <- c(1, 2, 3, 4)
case_when(a > 0 ~ a * 2, 
          a < 0 ~ 'a' * 2)

Error in "a" * 2 : non-numeric argument to binary operator

So in your case n() == 1 ~ fav_color is executed for every group. Since the groups which have more than one row will have more than one value of fav_color it maintains the number of rows in the output as the original dataframe. The solution is to use a summarising function as @theQi suggested that will always return an output of length 1.

dat %>%
  group_by(id) %>%
  summarize(
    fav_color1 = case_when(
      n() > 1 ~ 'Multiple Colors',
      n() == 1 ~ fav_color[1]
    )
  )

# A tibble: 5 x 2
#     id fav_color1     
#   <dbl> <chr>          
#1     1 Multiple Colors
#2     2 red            
#3     3 green          
#4     4 Multiple Colors
#5     5 pink      

Not relevant to the question but something worth knowing is fcase from data.table is different. It evaluates the output only when it is needed.

a <- c(1, 2, 3, 4)

data.table::fcase(a > 0, a * 2, 
                  a < 0, 'a' * 2)


#[1] 2 4 6 8
like image 138
Ronak Shah Avatar answered Sep 22 '25 13:09

Ronak Shah


case_when is not designed to deal with this, according to a Tidyverse developer (Github issue #5730).

like image 24
Dan Avatar answered Sep 22 '25 13:09

Dan