I have a dataset which has some missing values which I want to fill with other members of the same group. However, in some cases there is more than one value for each group and in these cases I want all the rows in each group to be replicated to have one row containing each value.
Sample data:
ID group value
1 1 A blue
2 2 A <NA>
3 3 A <NA>
4 4 B green
5 4 B red
6 5 B <NA>
7 6 B <NA>
8 7 C blue
9 8 C green
10 9 C NA
What I want to end up with is:
ID group value
1 1 A blue
2 2 A blue
3 3 A blue
4 4 B green
5 4 B red
6 5 B green
7 5 B red
8 6 B green
9 6 B red
10 7 C blue
11 7 C green
12 8 C blue
13 8 C green
14 9 C blue
15 9 C green
I have some cases where the group contains one ID which has two values (like group B) and others were there is more than one ID in the group, each with a different value (like C). In any case, I want a table where each member of the group has every value present in that group. I've found some answers dealing with simple cases like group A but none that have more than one value per group.
==== EDIT ====
My actual dataset is much bigger which has caused some additional problems. An updated sample table is below:
ID group value specific_value dataversion
1 A blue sky_blue version1
2 A <NA> <NA> version2
3 A <NA> <NA> version1
4 B green forest_green version1
4 B red scarlet version1
5 B <NA> <NA> version2
6 B <NA> <NA> <NA>
7 C blue royal_blue version2
8 C green lime_green version1
9 C <NA> <NA> version1
For each group I want to have a row with each set of value + specific_value from that group (but I wouldn't want a row with eg. blue and lime_green) for each member of the group. I want all the values for the other columns (ID, group, and dataversion) to be left as-is (including if eg. dataversion is NA).
Expected output:
ID group value specific_value dataversion
1 A blue sky_blue version1
2 A blue sky_blue version2
3 A blue sky_blue version1
4 B green forest_green version1
4 B red scarlet version1
5 B green forest_green version2
5 B red scarlet version2
6 B green forest_green <NA>
6 B red scarlet <NA>
7 C blue royal_blue version2
7 C green lime_green version2
8 C blue royal_blue version1
8 C green lime_green version1
9 C blue royal_blue version1
9 C green lime_green version1
Ie. each combination of ID, group, and dataversion in the table is the same as the original table but there is now a row for each combination of value and specific_value for each group. Note in my actual table I have ~50 columns of data (1 grouping column, ~6 are the equivalent to value/specific value here and the rest are treated like ID/dataversion) so I'd prefer not to have to type every column name.
We may need complete here. After grouping by 'group', use complete to get the combinations of unique non-NA 'value' for each 'group' and 'ID'
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
group_by(group) %>%
complete(ID, value = unique(value[!is.na(value)])) %>%
na.omit %>%
select(names(df1))
# A tibble: 15 x 3
# Groups: group [3]
# ID group value
# <int> <chr> <chr>
# 1 1 A blue
# 2 2 A blue
# 3 3 A blue
# 4 4 B green
# 5 4 B red
# 6 5 B green
# 7 5 B red
# 8 6 B green
# 9 6 B red
#10 7 C blue
#11 7 C green
#12 8 C blue
#13 8 C green
#14 9 C blue
#15 9 C green
with the new dataset, we can do
df2 %>%
group_by(group) %>%
mutate(valnew = str_c(value, specific_value, sep=":")) %>%
select(-value, -specific_value, -dataversion) %>%
complete(ID, valnew = unique(valnew[!is.na(valnew)])) %>%
filter(!is.na(valnew)) %>%
separate(valnew, into = c('value', 'specific_value'), sep=":") %>%
mutate(rn = row_number()) %>%
left_join(df2 %>%
select(ID, dataversion)) %>%
filter(!duplicated(rn)) %>%
select(names(df2))
# A tibble: 15 x 5
# Groups: group [3]
# ID group value specific_value dataversion
# <int> <chr> <chr> <chr> <chr>
# 1 1 A blue sky_blue version1
# 2 2 A blue sky_blue version2
# 3 3 A blue sky_blue version1
# 4 4 B green forest_green version1
# 5 4 B red scarlet version1
# 6 5 B green forest_green version2
# 7 5 B red scarlet version2
# 8 6 B green forest_green <NA>
# 9 6 B red scarlet <NA>
#10 7 C blue royal_blue version2
#11 7 C green lime_green version2
#12 8 C blue royal_blue version1
#13 8 C green lime_green version1
#14 9 C blue royal_blue version1
#15 9 C green lime_green version1
df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 4L, 5L, 6L, 7L, 8L, 9L),
group = c("A", "A", "A", "B", "B", "B", "B", "C", "C", "C"
), value = c("blue", NA, NA, "green", "red", NA, NA, "blue",
"green", NA)), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10"), class = "data.frame")
df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 4L, 5L, 6L, 7L, 8L, 9L),
group = c("A", "A", "A", "B", "B", "B", "B", "C", "C", "C"
), value = c("blue", NA, NA, "green", "red", NA, NA, "blue",
"green", NA), specific_value = c("sky_blue", NA, NA, "forest_green",
"scarlet", NA, NA, "royal_blue", "lime_green", NA), dataversion = c("version1",
"version2", "version1", "version1", "version1", "version2",
NA, "version2", "version1", "version1")), class = "data.frame",
row.names = c(NA,
-10L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With