Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: group_by and summarize to collapse (via concatenation) columns of strings that contain NA

Tags:

r

dplyr

summarize

I have a relatively straightforward question that I've been unable to find a solution for.

Suppose I have the following dataset:

ID dummy_var String1 String2 String3
1 0 Tom NA NA
1 1 NA Jo NA
2 0 Tom NA NA
2 1 NA Jo NA
2 0 NA NA Bob
3 0 Steve NA NA
3 0 NA Timmy NA
4 0 Alex NA NA

I want to use group by and summarize to get the following:

ID dummy_var String1 String2 String3
1 1 Tom Jo NA
2 1 Tom Jo Bob
3 0 Steve Timmy NA
4 0 Alex NA NA

I've had no trouble with the "dummy_var", using a variation of dummy_var = max(dummy_var) within a summarize function, but I cannot seem to find anything on how to get the strings as I want.

I have tried variations like:

group_by(ID) %>%
summarize(
String1 = str_c(String1)
)

or

group_by(ID) %>%
summarize(
String1 = case_when(
     length(str_c(String1)) > 0 ~ str_c(String1)
     str_c(String1) == rep(NA,length(str_c(String1)) ~ NA
     )
)

When doing the first attempt, the rows do not actually change. For instance, although numeric operations such as max(dummy var) will yield 0 or 1 as intended for each row within the group, the string variables are not summarized and when ungrouping and printing the dataframe you get multiple rows per ID, as if you never had summarized the string columns in the first place.

With the second approach, the function always fails when there is a case where for each group all values are NA, saying that "String(i) must be of length greater than 0" or some variation of that.

I noticed that if I try the following

group_by(ID) %>%
summarize(
String1 = str_replace_na(String1)
)

The output is the same as the first code block, as if nothing happened at all.

Other facts about my data: String 1 will always have, per group, at least one value without NA. For String2 and String 3, there are many that contain all NA per group, and I want the collapsed row to read NA as well, as per my example. Futhermore, in no case does any group_by() group have columns with more than one row containing something other than NA; i.e., within groups, each row only has one of the three String1/2/3 as something other than NA, or they may all be NA (such as in ID=2 in my example). All other columns that contain int or double values summarize with no problem. It is just the strings. Using paste0 in lieu of str_c() also makes no difference.

Can anyone give me advice? I couldn't find any example like this online where NAs are within columns within groups, and also where within groups they sometimes comprise all the values within columns.

My only alternative would be to use replace_na() on all NAs, concatenate them with some filler text, then go back and for each value pluck them out with stringr or something. It works, but I know there must be an elegant approach!

EDIT: It turns out, if I use str_replace_na() instead of str_c(), you end up getting, for instance,

ID dummy_var String1 String2 String3
1 1 Tom "NA" "NA"
1 1 "NA" "Jo" "NA"
2 1 Tom "NA" "NA"
2 1 "NA" "Jo" "NA"
2 1 "NA" "NA" Bob

That is, the values are replaced with the string "NA" rather than a NA. This is surprising given that the following is true:

str_replace_na("Something",NA)
> "Something"
str_c("Something",NA)
> NA
like image 836
econometrica_33 Avatar asked Dec 31 '22 13:12

econometrica_33


1 Answers

You could use tidyr's fill-function:

library(tidyr)
library(dplyr)

df %>% 
  group_by(ID) %>% 
  fill(starts_with("String"), .direction="downup") %>% 
  filter(dummy_var == max(dummy_var)) %>% 
  distinct() %>% 
  ungroup()

which returns

# A tibble: 4 x 5
     ID dummy_var String1 String2 String3
  <dbl>     <dbl> <chr>   <chr>   <chr>  
1     1         1 Tom     Jo      NA     
2     2         1 Tom     Jo      Bob    
3     3         0 Steve   Timmy   NA     
4     4         0 Alex    NA      NA   

##Data

df <- structure(list(ID = c(1, 1, 2, 2, 2, 3, 3, 4), dummy_var = c(0, 
1, 0, 1, 0, 0, 0, 0), String1 = c("Tom", NA, "Tom", NA, NA, "Steve", 
NA, "Alex"), String2 = c(NA, "Jo", NA, "Jo", NA, NA, "Timmy", 
NA), String3 = c(NA, NA, NA, NA, "Bob", NA, NA, NA)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L), spec = structure(list(
    cols = list(ID = structure(list(), class = c("collector_double", 
    "collector")), dummy_var = structure(list(), class = c("collector_double", 
    "collector")), String1 = structure(list(), class = c("collector_character", 
    "collector")), String2 = structure(list(), class = c("collector_character", 
    "collector")), String3 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))
like image 102
Martin Gal Avatar answered Jan 13 '23 12:01

Martin Gal