Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling missing dates in a grouped time series - a tidyverse-way?

Given a data.frame that contains a time series and one or ore grouping fields. So we have several time series - one for each grouping combination. But some dates are missing. So, what's the easiest (in terms of the most "tidyverse way") of adding these dates with the right grouping values?

Normally I would say I generate a data.frame with all dates and do a full_join with my time series. But now we have to do it for each combination of grouping values -- and fill in the grouping values.

Let's look at an example:

First I create a data.frame with missing values:

library(dplyr)
library(lubridate)

set.seed(1234)
# Time series should run vom 2017-01-01 til 2017-01-10
date <- data.frame(date = seq.Date(from=ymd("2017-01-01"), to=ymd("2017-01-10"), by="days"), v = 1)
# Two grouping dimensions
d1   <- data.frame(d1 = c("A", "B", "C", "D"), v = 1)
d2   <- data.frame(d2 = c(1, 2, 3, 4, 5), v = 1)

# Generate the data.frame
df <- full_join(date, full_join(d1, d2)) %>%
  select(date, d1, d2) 
# and ad to value columns
df$v1 <- runif(200)
df$v2 <- runif(200)

# group by the dimension columns
df <- df %>% 
  group_by(d1, d2)

# create missing dates
df.missing <- df %>%
  filter(v1 <= 0.8)

# So now  2017-01-01 and 2017-01-10, A, 5 are missing now
df.missing %>%
  filter(d1 == "A" & d2 == 5)

# A tibble: 8 x 5
# Groups:   d1, d2 [1]
        date     d1    d2         v1        v2
      <date> <fctr> <dbl>      <dbl>     <dbl>
1 2017-01-02      A     5 0.21879954 0.1335497
2 2017-01-03      A     5 0.32977018 0.9802127
3 2017-01-04      A     5 0.23902573 0.1206089
4 2017-01-05      A     5 0.19617465 0.7378315
5 2017-01-06      A     5 0.13373890 0.9493668
6 2017-01-07      A     5 0.48613541 0.3392834
7 2017-01-08      A     5 0.35698708 0.3696965
8 2017-01-09      A     5 0.08498474 0.8354756

So to add the missing dates I generate a data.frame with all dates:

start <- min(df.missing$date)
end   <- max(df.missing$date)

all.dates <- data.frame(date=seq.Date(start, end, by="day"))

No I want to do something like (remember: df.missing is group_by(d1, d2))

df.missing %>%
  do(my_join())

So let's define my_join():

my_join <- function(data) {
  # get value of both dimensions
  d1.set <- data$d1[[1]]
  d2.set <- data$d2[[1]]

  tmp <- full_join(data, all.dates) %>%
    # First we need to ungroup.  Otherwise we can't change d1 and d2 because they are grouping variables
    ungroup() %>%
    mutate(
      d1 = d1.set,
      d2 = d2.set 
    ) %>%
    group_by(d1, d2)

  return(tmp)
}

Now we can call my_join() for each combination and have a look at "A/5"

df.missing %>%
  do(my_join(.)) %>%
  filter(d1 == "A" & d2 == 5)

# A tibble: 10 x 5
# Groups:   d1, d2 [1]
         date     d1    d2         v1        v2
       <date> <fctr> <dbl>      <dbl>     <dbl>
 1 2017-01-02      A     5 0.21879954 0.1335497
 2 2017-01-03      A     5 0.32977018 0.9802127
 3 2017-01-04      A     5 0.23902573 0.1206089
 4 2017-01-05      A     5 0.19617465 0.7378315
 5 2017-01-06      A     5 0.13373890 0.9493668
 6 2017-01-07      A     5 0.48613541 0.3392834
 7 2017-01-08      A     5 0.35698708 0.3696965
 8 2017-01-09      A     5 0.08498474 0.8354756
 9 2017-01-01      A     5         NA        NA
10 2017-01-10      A     5         NA        NA

Great! That's what we were looking for. But we need to define d1 and d2 in my_join and it feels a little bit clumsy.

So, is there any tidyverse-way of this solution?

P.S.: I've put the code into a gist: https://gist.github.com/JerryWho/1bf919ef73792569eb38f6462c6d7a8e

like image 706
JerryWho Avatar asked Sep 09 '17 11:09

JerryWho


People also ask

Which function is used for filling NA value with consecutive values in R?

The fillna() function is used to fill NA/NaN values using the specified method.


1 Answers

tidyr has some great tools for these sorts of problems. Take a look at complete.


library(dplyr)
library(tidyr)
library(lubridate)

want <- df.missing %>% 
  ungroup() %>%
  complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))

want %>% filter(d1 == "A" & d2 == 5) 

#> # A tibble: 10 x 5
#>        d1    d2       date         v1        v2
#>    <fctr> <dbl>     <date>      <dbl>     <dbl>
#>  1      A     5 2017-01-01         NA        NA
#>  2      A     5 2017-01-02 0.21879954 0.1335497
#>  3      A     5 2017-01-03 0.32977018 0.9802127
#>  4      A     5 2017-01-04 0.23902573 0.1206089
#>  5      A     5 2017-01-05 0.19617465 0.7378315
#>  6      A     5 2017-01-06 0.13373890 0.9493668
#>  7      A     5 2017-01-07 0.48613541 0.3392834
#>  8      A     5 2017-01-08 0.35698708 0.3696965
#>  9      A     5 2017-01-09 0.08498474 0.8354756
#> 10      A     5 2017-01-10         NA        NA
like image 119
austensen Avatar answered Sep 29 '22 10:09

austensen