I have a relatively straightforward question that I've been unable to find a solution for.
Suppose I have the following dataset:
ID | dummy_var | String1 | String2 | String3 |
---|---|---|---|---|
1 | 0 | Tom | NA | NA |
1 | 1 | NA | Jo | NA |
2 | 0 | Tom | NA | NA |
2 | 1 | NA | Jo | NA |
2 | 0 | NA | NA | Bob |
3 | 0 | Steve | NA | NA |
3 | 0 | NA | Timmy | NA |
4 | 0 | Alex | NA | NA |
I want to use group by and summarize to get the following:
ID | dummy_var | String1 | String2 | String3 |
---|---|---|---|---|
1 | 1 | Tom | Jo | NA |
2 | 1 | Tom | Jo | Bob |
3 | 0 | Steve | Timmy | NA |
4 | 0 | Alex | NA | NA |
I've had no trouble with the "dummy_var", using a variation of dummy_var = max(dummy_var) within a summarize function, but I cannot seem to find anything on how to get the strings as I want.
I have tried variations like:
group_by(ID) %>%
summarize(
String1 = str_c(String1)
)
or
group_by(ID) %>%
summarize(
String1 = case_when(
length(str_c(String1)) > 0 ~ str_c(String1)
str_c(String1) == rep(NA,length(str_c(String1)) ~ NA
)
)
When doing the first attempt, the rows do not actually change. For instance, although numeric operations such as max(dummy var) will yield 0 or 1 as intended for each row within the group, the string variables are not summarized and when ungrouping and printing the dataframe you get multiple rows per ID, as if you never had summarized the string columns in the first place.
With the second approach, the function always fails when there is a case where for each group all values are NA, saying that "String(i) must be of length greater than 0" or some variation of that.
I noticed that if I try the following
group_by(ID) %>%
summarize(
String1 = str_replace_na(String1)
)
The output is the same as the first code block, as if nothing happened at all.
Other facts about my data: String 1 will always have, per group, at least one value without NA. For String2 and String 3, there are many that contain all NA per group, and I want the collapsed row to read NA as well, as per my example. Futhermore, in no case does any group_by() group have columns with more than one row containing something other than NA; i.e., within groups, each row only has one of the three String1/2/3 as something other than NA, or they may all be NA (such as in ID=2 in my example). All other columns that contain int or double values summarize with no problem. It is just the strings. Using paste0 in lieu of str_c() also makes no difference.
Can anyone give me advice? I couldn't find any example like this online where NAs are within columns within groups, and also where within groups they sometimes comprise all the values within columns.
My only alternative would be to use replace_na() on all NAs, concatenate them with some filler text, then go back and for each value pluck them out with stringr or something. It works, but I know there must be an elegant approach!
EDIT: It turns out, if I use str_replace_na() instead of str_c(), you end up getting, for instance,
ID | dummy_var | String1 | String2 | String3 |
---|---|---|---|---|
1 | 1 | Tom | "NA" | "NA" |
1 | 1 | "NA" | "Jo" | "NA" |
2 | 1 | Tom | "NA" | "NA" |
2 | 1 | "NA" | "Jo" | "NA" |
2 | 1 | "NA" | "NA" | Bob |
That is, the values are replaced with the string "NA" rather than a NA. This is surprising given that the following is true:
str_replace_na("Something",NA)
> "Something"
str_c("Something",NA)
> NA
You could use tidyr
's fill
-function:
library(tidyr)
library(dplyr)
df %>%
group_by(ID) %>%
fill(starts_with("String"), .direction="downup") %>%
filter(dummy_var == max(dummy_var)) %>%
distinct() %>%
ungroup()
which returns
# A tibble: 4 x 5
ID dummy_var String1 String2 String3
<dbl> <dbl> <chr> <chr> <chr>
1 1 1 Tom Jo NA
2 2 1 Tom Jo Bob
3 3 0 Steve Timmy NA
4 4 0 Alex NA NA
##Data
df <- structure(list(ID = c(1, 1, 2, 2, 2, 3, 3, 4), dummy_var = c(0,
1, 0, 1, 0, 0, 0, 0), String1 = c("Tom", NA, "Tom", NA, NA, "Steve",
NA, "Alex"), String2 = c(NA, "Jo", NA, "Jo", NA, NA, "Timmy",
NA), String3 = c(NA, NA, NA, NA, "Bob", NA, NA, NA)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L), spec = structure(list(
cols = list(ID = structure(list(), class = c("collector_double",
"collector")), dummy_var = structure(list(), class = c("collector_double",
"collector")), String1 = structure(list(), class = c("collector_character",
"collector")), String2 = structure(list(), class = c("collector_character",
"collector")), String3 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With