Objective: Fill in implicit missing values for each ID and each group over the entire time span of the data, including groups that are entirely missing from the dataset.
Background: Each ID has several groups, and each group should have values for each day. However, some values are missing certain dates (such as 1b 2019-01-01), and some groups are entirely missing from the dataset (such as 2b).
# A tibble: 8 x 4
ID group date value
<dbl> <chr> <chr> <dbl>
1 1 a 2019-01-01 1
2 1 a 2019-01-02 3
3 1 a 2019-01-03 4
4 1 b 2019-01-02 4
5 1 b 2019-01-03 5
6 2 a 2019-01-01 8
7 2 a 2019-01-02 9
8 2 a 2019-01-03 1
Code to generate the dataframe
df <- tibble(ID = c(1, 1, 1, 1, 1, 2, 2, 2),
group = c('a', 'a', 'a', 'b', 'b', 'a', 'a', 'a'),
date = c('2019-01-01', '2019-01-02','2019-01-03', '2019-01-02', '2019-01-03', '2019-01-01', '2019-01-02', '2019-01-03'),
value = c(1, 3, 4, 4, 5, 8, 9, 1))
Attempt 1:
library(tsibble)
df %>%
# tsibble format
as_tsibble(key = c(ID, group), index = date) %>%
# group by
group_by(ID, group) %>%
# fill gaps
fill_gaps(.full = TRUE)
Attempt 2:
library(tidyverse)
complete(df, expand(df, nesting(ID, group), date = full_seq(date,1)))
Both yield the same result: note that group 2b is missing
# A tsibble: 9 x 4 [1D]
# Key: ID, group [3]
# Groups: ID, group [3]
ID group date value
<dbl> <chr> <date> <dbl>
1 1 a 2019-01-01 1
2 1 a 2019-01-02 3
3 1 a 2019-01-03 4
4 1 b 2019-01-01 NA
5 1 b 2019-01-02 4
6 1 b 2019-01-03 5
7 2 a 2019-01-01 8
8 2 a 2019-01-02 9
9 2 a 2019-01-03 1
I expect the output to be
# A tibble: 12 x 4
ID group date value
<dbl> <chr> <chr> <dbl>
1 1 a 2019-01-01 1
2 1 a 2019-01-02 3
3 1 a 2019-01-03 4
4 1 b 2019-01-01 NA
5 1 b 2019-01-02 4
6 1 b 2019-01-03 5
7 2 a 2019-01-01 8
8 2 a 2019-01-02 9
9 2 a 2019-01-03 1
10 2 b 2019-01-01 NA
11 2 b 2019-01-02 NA
12 2 b 2019-01-03 NA
where group b, which is implicitly missing from the dataset, appears in the result.
Note that my dataset has eight million rows and it keeps growing every day. The code will be executed every day so I am really looking forward to a fast and convenient way of performing the task. But any idea or answer is welcome!
Actually your second attempt is very close to what you expect, but need to replace nesting() with crossing(). nesting() only looks for the combinations appeared in the data, but crossing() looks for all possible combinations.
library(tidyr)
df <- tibble(ID = c(1, 1, 1, 1, 1, 2, 2, 2),
group = c('a', 'a', 'a', 'b', 'b', 'a', 'a', 'a'),
date = as.Date(c('2019-01-01', '2019-01-02','2019-01-03', '2019-01-02', '2019-01-03', '2019-01-01', '2019-01-02', '2019-01-03')),
value = c(1, 3, 4, 4, 5, 8, 9, 1))
complete(df, expand(df, crossing(ID, group), date = full_seq(date, 1)))
#> # A tibble: 12 x 4
#> ID group date value
#> <dbl> <chr> <date> <dbl>
#> 1 1 a 2019-01-01 1
#> 2 1 a 2019-01-02 3
#> 3 1 a 2019-01-03 4
#> 4 1 b 2019-01-01 NA
#> 5 1 b 2019-01-02 4
#> 6 1 b 2019-01-03 5
#> 7 2 a 2019-01-01 8
#> 8 2 a 2019-01-02 9
#> 9 2 a 2019-01-03 1
#> 10 2 b 2019-01-01 NA
#> 11 2 b 2019-01-02 NA
#> 12 2 b 2019-01-03 NA
Created on 2019-10-13 by the reprex package (v0.3.0)
You could create a dataframe containing the cartesian product of unique values of ID, group and date, then merge that with the original dataframe to obtain the desired result.
zz <- merge(merge(unique(df$ID), unique(df$group), by = NULL, all = TRUE),
unique(df$date), by = NULL, all = TRUE)
merge(df, zz, by.x = c('ID','group', 'date'), by.y = c('x','y.x', 'y.y'), all = TRUE)
It's expensive, of course, if you have a large number of components and dates. You could, in that case, first merge the ID/group combination against your original dataframe to obtain just the missing combinations, then merge that against the set of dates, again picking up only the missing combinations, if you catch my drift.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With