R: Fill in Implicit Missing Values and Groups to the Entire Time Span of the Data

Question

Objective: Fill in implicit missing values for each ID and each group over the entire time span of the data, including groups that are entirely missing from the dataset.

Background: Each ID has several groups, and each group should have values for each day. However, some values are missing certain dates (such as 1b 2019-01-01), and some groups are entirely missing from the dataset (such as 2b).

# A tibble: 8 x 4
     ID group date       value
  <dbl> <chr> <chr>      <dbl>
1     1 a     2019-01-01     1
2     1 a     2019-01-02     3
3     1 a     2019-01-03     4
4     1 b     2019-01-02     4
5     1 b     2019-01-03     5
6     2 a     2019-01-01     8
7     2 a     2019-01-02     9
8     2 a     2019-01-03     1

Code to generate the dataframe

df <- tibble(ID = c(1, 1, 1, 1, 1, 2, 2, 2),
             group = c('a', 'a', 'a', 'b', 'b', 'a', 'a', 'a'),
             date = c('2019-01-01', '2019-01-02','2019-01-03', '2019-01-02', '2019-01-03', '2019-01-01', '2019-01-02', '2019-01-03'),
             value = c(1, 3, 4, 4, 5, 8, 9, 1))

Attempt 1:

library(tsibble)
df %>% 
  # tsibble format
  as_tsibble(key = c(ID, group), index = date) %>%
  # group by
  group_by(ID, group) %>%
  # fill gaps
  fill_gaps(.full = TRUE)

Attempt 2:

library(tidyverse)
complete(df, expand(df, nesting(ID, group), date = full_seq(date,1)))

Both yield the same result: note that group 2b is missing

# A tsibble: 9 x 4 [1D]
# Key:       ID, group [3]
# Groups:    ID, group [3]
     ID group date       value
  <dbl> <chr> <date>     <dbl>
1     1 a     2019-01-01     1
2     1 a     2019-01-02     3
3     1 a     2019-01-03     4
4     1 b     2019-01-01    NA
5     1 b     2019-01-02     4
6     1 b     2019-01-03     5
7     2 a     2019-01-01     8
8     2 a     2019-01-02     9
9     2 a     2019-01-03     1

I expect the output to be

# A tibble: 12 x 4
      ID group date       value
   <dbl> <chr> <chr>      <dbl>
 1     1 a     2019-01-01     1
 2     1 a     2019-01-02     3
 3     1 a     2019-01-03     4
 4     1 b     2019-01-01     NA
 5     1 b     2019-01-02     4
 6     1 b     2019-01-03     5
 7     2 a     2019-01-01     8
 8     2 a     2019-01-02     9
 9     2 a     2019-01-03     1
10     2 b     2019-01-01     NA
11     2 b     2019-01-02     NA
12     2 b     2019-01-03     NA

where group b, which is implicitly missing from the dataset, appears in the result.

Note that my dataset has eight million rows and it keeps growing every day. The code will be executed every day so I am really looking forward to a fast and convenient way of performing the task. But any idea or answer is welcome!

Earo Wang · Accepted Answer

Actually your second attempt is very close to what you expect, but need to replace nesting() with crossing(). nesting() only looks for the combinations appeared in the data, but crossing() looks for all possible combinations.

library(tidyr)
df <- tibble(ID = c(1, 1, 1, 1, 1, 2, 2, 2),
             group = c('a', 'a', 'a', 'b', 'b', 'a', 'a', 'a'),
             date = as.Date(c('2019-01-01', '2019-01-02','2019-01-03', '2019-01-02', '2019-01-03', '2019-01-01', '2019-01-02', '2019-01-03')),
             value = c(1, 3, 4, 4, 5, 8, 9, 1))
complete(df, expand(df, crossing(ID, group), date = full_seq(date, 1)))
#> # A tibble: 12 x 4
#>       ID group date       value
#>    <dbl> <chr> <date>     <dbl>
#>  1     1 a     2019-01-01     1
#>  2     1 a     2019-01-02     3
#>  3     1 a     2019-01-03     4
#>  4     1 b     2019-01-01    NA
#>  5     1 b     2019-01-02     4
#>  6     1 b     2019-01-03     5
#>  7     2 a     2019-01-01     8
#>  8     2 a     2019-01-02     9
#>  9     2 a     2019-01-03     1
#> 10     2 b     2019-01-01    NA
#> 11     2 b     2019-01-02    NA
#> 12     2 b     2019-01-03    NA

^{Created on 2019-10-13 by the reprex package (v0.3.0)}

user2474226 · Answer

You could create a dataframe containing the cartesian product of unique values of ID, group and date, then merge that with the original dataframe to obtain the desired result.

zz <- merge(merge(unique(df$ID), unique(df$group), by = NULL, all = TRUE), 
            unique(df$date), by = NULL, all = TRUE)
merge(df, zz, by.x = c('ID','group', 'date'), by.y = c('x','y.x', 'y.y'), all = TRUE)

It's expensive, of course, if you have a large number of components and dates. You could, in that case, first merge the ID/group combination against your original dataframe to obtain just the missing combinations, then merge that against the set of dates, again picking up only the missing combinations, if you catch my drift.

R: Fill in Implicit Missing Values and Groups to the Entire Time Span of the Data

Tags:

r

missing-data

dplyr

tidyr

tibble

June

2 Answers

Earo Wang

user2474226

Recent Activity

Donate For Us

R: Fill in Implicit Missing Values and Groups to the Entire Time Span of the Data

Tags:

r

missing-data

dplyr

tidyr

tibble

June

2 Answers

Earo Wang

user2474226

Related questions

Recent Activity

Donate For Us