Given a data.frame that contains a time series and one or ore grouping fields. So we have several time series - one for each grouping combination. But some dates are missing. So, what's the easiest (in terms of the most "tidyverse way") of adding these dates with the right grouping values? Normally I would say I generate a data.frame with all dates and do a full_join with my time series. But now we have to do it for each combination of grouping values -- and fill in the grouping values. Let's look at an example: First I create a data.frame with missing values: <pre class="prettyprint"><code>library(dplyr) library(lubridate) set.seed(1234) # Time series should run vom 2017-01-01 til 2017-01-10 date <- data.frame(date = seq.Date(from=ymd("2017-01-01"), to=ymd("2017-01-10"), by="days"), v = 1) # Two grouping dimensions d1 <- data.frame(d1 = c("A", "B", "C", "D"), v = 1) d2 <- data.frame(d2 = c(1, 2, 3, 4, 5), v = 1) # Generate the data.frame df <- full_join(date, full_join(d1, d2)) %>% select(date, d1, d2) # and ad to value columns df$v1 <- runif(200) df$v2 <- runif(200) # group by the dimension columns df <- df %>% group_by(d1, d2) # create missing dates df.missing <- df %>% filter(v1 <= 0.8) # So now 2017-01-01 and 2017-01-10, A, 5 are missing now df.missing %>% filter(d1 == "A" & d2 == 5) # A tibble: 8 x 5 # Groups: d1, d2 [1] date d1 d2 v1 v2 <date> <fctr> <dbl> <dbl> <dbl> 1 2017-01-02 A 5 0.21879954 0.1335497 2 2017-01-03 A 5 0.32977018 0.9802127 3 2017-01-04 A 5 0.23902573 0.1206089 4 2017-01-05 A 5 0.19617465 0.7378315 5 2017-01-06 A 5 0.13373890 0.9493668 6 2017-01-07 A 5 0.48613541 0.3392834 7 2017-01-08 A 5 0.35698708 0.3696965 8 2017-01-09 A 5 0.08498474 0.8354756 </code></pre> So to add the missing dates I generate a data.frame with all dates: <pre class="prettyprint"><code>start <- min(df.missing$date) end <- max(df.missing$date) all.dates <- data.frame(date=seq.Date(start, end, by="day")) </code></pre> No I want to do something like (remember: df.missing is group_by(d1, d2)) <pre class="prettyprint"><code>df.missing %>% do(my_join()) </code></pre> So let's define my_join(): <pre class="prettyprint"><code>my_join <- function(data) { # get value of both dimensions d1.set <- data$d1[[1]] d2.set <- data$d2[[1]] tmp <- full_join(data, all.dates) %>% # First we need to ungroup. Otherwise we can't change d1 and d2 because they are grouping variables ungroup() %>% mutate( d1 = d1.set, d2 = d2.set ) %>% group_by(d1, d2) return(tmp) } </code></pre> Now we can call my_join() for each combination and have a look at "A/5" <pre class="prettyprint"><code>df.missing %>% do(my_join(.)) %>% filter(d1 == "A" & d2 == 5) # A tibble: 10 x 5 # Groups: d1, d2 [1] date d1 d2 v1 v2 <date> <fctr> <dbl> <dbl> <dbl> 1 2017-01-02 A 5 0.21879954 0.1335497 2 2017-01-03 A 5 0.32977018 0.9802127 3 2017-01-04 A 5 0.23902573 0.1206089 4 2017-01-05 A 5 0.19617465 0.7378315 5 2017-01-06 A 5 0.13373890 0.9493668 6 2017-01-07 A 5 0.48613541 0.3392834 7 2017-01-08 A 5 0.35698708 0.3696965 8 2017-01-09 A 5 0.08498474 0.8354756 9 2017-01-01 A 5 NA NA 10 2017-01-10 A 5 NA NA </code></pre> Great! That's what we were looking for. But we need to define d1 and d2 in my_join and it feels a little bit clumsy. So, is there any tidyverse-way of this solution? P.S.: I've put the code into a gist: https://gist.github.com/JerryWho/1bf919ef73792569eb38f6462c6d7a8e

<code>tidyr</code> has some great tools for these sorts of problems. Take a look at <code>complete</code>. <pre class="prettyprint lang-r prettyprint-override"><code>library(dplyr) library(tidyr) library(lubridate) want <- df.missing %>% ungroup() %>% complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day")) want %>% filter(d1 == "A" & d2 == 5) #> # A tibble: 10 x 5 #> d1 d2 date v1 v2 #> <fctr> <dbl> <date> <dbl> <dbl> #> 1 A 5 2017-01-01 NA NA #> 2 A 5 2017-01-02 0.21879954 0.1335497 #> 3 A 5 2017-01-03 0.32977018 0.9802127 #> 4 A 5 2017-01-04 0.23902573 0.1206089 #> 5 A 5 2017-01-05 0.19617465 0.7378315 #> 6 A 5 2017-01-06 0.13373890 0.9493668 #> 7 A 5 2017-01-07 0.48613541 0.3392834 #> 8 A 5 2017-01-08 0.35698708 0.3696965 #> 9 A 5 2017-01-09 0.08498474 0.8354756 #> 10 A 5 2017-01-10 NA NA </code></pre>

Filling missing dates in a grouped time series - a tidyverse-way?

Tags:

r

dplyr

time-series

tidyverse

Given a data.frame that contains a time series and one or ore grouping fields. So we have several time series - one for each grouping combination. But some dates are missing. So, what's the easiest (in terms of the most "tidyverse way") of adding these dates with the right grouping values?

Normally I would say I generate a data.frame with all dates and do a full_join with my time series. But now we have to do it for each combination of grouping values -- and fill in the grouping values.

Let's look at an example:

First I create a data.frame with missing values:

library(dplyr)
library(lubridate)

set.seed(1234)
# Time series should run vom 2017-01-01 til 2017-01-10
date <- data.frame(date = seq.Date(from=ymd("2017-01-01"), to=ymd("2017-01-10"), by="days"), v = 1)
# Two grouping dimensions
d1   <- data.frame(d1 = c("A", "B", "C", "D"), v = 1)
d2   <- data.frame(d2 = c(1, 2, 3, 4, 5), v = 1)

# Generate the data.frame
df <- full_join(date, full_join(d1, d2)) %>%
  select(date, d1, d2) 
# and ad to value columns
df$v1 <- runif(200)
df$v2 <- runif(200)

# group by the dimension columns
df <- df %>% 
  group_by(d1, d2)

# create missing dates
df.missing <- df %>%
  filter(v1 <= 0.8)

# So now  2017-01-01 and 2017-01-10, A, 5 are missing now
df.missing %>%
  filter(d1 == "A" & d2 == 5)

# A tibble: 8 x 5
# Groups:   d1, d2 [1]
        date     d1    d2         v1        v2
      <date> <fctr> <dbl>      <dbl>     <dbl>
1 2017-01-02      A     5 0.21879954 0.1335497
2 2017-01-03      A     5 0.32977018 0.9802127
3 2017-01-04      A     5 0.23902573 0.1206089
4 2017-01-05      A     5 0.19617465 0.7378315
5 2017-01-06      A     5 0.13373890 0.9493668
6 2017-01-07      A     5 0.48613541 0.3392834
7 2017-01-08      A     5 0.35698708 0.3696965
8 2017-01-09      A     5 0.08498474 0.8354756

So to add the missing dates I generate a data.frame with all dates:

start <- min(df.missing$date)
end   <- max(df.missing$date)

all.dates <- data.frame(date=seq.Date(start, end, by="day"))

No I want to do something like (remember: df.missing is group_by(d1, d2))

df.missing %>%
  do(my_join())

So let's define my_join():

my_join <- function(data) {
  # get value of both dimensions
  d1.set <- data$d1[[1]]
  d2.set <- data$d2[[1]]

  tmp <- full_join(data, all.dates) %>%
    # First we need to ungroup.  Otherwise we can't change d1 and d2 because they are grouping variables
    ungroup() %>%
    mutate(
      d1 = d1.set,
      d2 = d2.set 
    ) %>%
    group_by(d1, d2)

  return(tmp)
}

Now we can call my_join() for each combination and have a look at "A/5"

df.missing %>%
  do(my_join(.)) %>%
  filter(d1 == "A" & d2 == 5)

# A tibble: 10 x 5
# Groups:   d1, d2 [1]
         date     d1    d2         v1        v2
       <date> <fctr> <dbl>      <dbl>     <dbl>
 1 2017-01-02      A     5 0.21879954 0.1335497
 2 2017-01-03      A     5 0.32977018 0.9802127
 3 2017-01-04      A     5 0.23902573 0.1206089
 4 2017-01-05      A     5 0.19617465 0.7378315
 5 2017-01-06      A     5 0.13373890 0.9493668
 6 2017-01-07      A     5 0.48613541 0.3392834
 7 2017-01-08      A     5 0.35698708 0.3696965
 8 2017-01-09      A     5 0.08498474 0.8354756
 9 2017-01-01      A     5         NA        NA
10 2017-01-10      A     5         NA        NA

Great! That's what we were looking for. But we need to define d1 and d2 in my_join and it feels a little bit clumsy.

So, is there any tidyverse-way of this solution?

P.S.: I've put the code into a gist: https://gist.github.com/JerryWho/1bf919ef73792569eb38f6462c6d7a8e

706

asked Sep 09 '17 11:09

JerryWho

1 Answers

tidyr has some great tools for these sorts of problems. Take a look at complete.

library(dplyr)
library(tidyr)
library(lubridate)

want <- df.missing %>% 
  ungroup() %>%
  complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))

want %>% filter(d1 == "A" & d2 == 5) 

#> # A tibble: 10 x 5
#>        d1    d2       date         v1        v2
#>    <fctr> <dbl>     <date>      <dbl>     <dbl>
#>  1      A     5 2017-01-01         NA        NA
#>  2      A     5 2017-01-02 0.21879954 0.1335497
#>  3      A     5 2017-01-03 0.32977018 0.9802127
#>  4      A     5 2017-01-04 0.23902573 0.1206089
#>  5      A     5 2017-01-05 0.19617465 0.7378315
#>  6      A     5 2017-01-06 0.13373890 0.9493668
#>  7      A     5 2017-01-07 0.48613541 0.3392834
#>  8      A     5 2017-01-08 0.35698708 0.3696965
#>  9      A     5 2017-01-09 0.08498474 0.8354756
#> 10      A     5 2017-01-10         NA        NA

119

answered Sep 29 '22 10:09

austensen

Related questions
                            
                                Pass PCA preprocessing arguments to train()
                            
                                RStudio shows a different $PATH variable
                            
                                How to change background colour of legend in ggplot2?
                            
                                Interactive directory input in Shiny app (R)
                            
                                How sessions work in shiny-server?
                            
                                How to edit with multiple-selections in RStudio?
                            
                                Why does "vectorizing" this simple R loop give a different result?
                            
                                How can I add hatches, stripes or another pattern or texture to a barplot in ggplot?
                            
                                ggplot2 legend for stat_summary
                            
                                calculating double integrals in R quickly
                            
                                Use neo4j with R
                            
                                Does converting character columns to factors save memory?
                            
                                How can I adjust the axes to start from zero origin in r plot
                            
                                How to get vertex ids back from graph
                            
                                dplyr to output class data.frame
                            
                                Problems with try() inside foreach() in R
                            
                                fread() of file from archive
                            
                                R shiny - background of sidebar panel
                            
                                German Umlaut characters in R markdown
                            
                                Programmatically insert header and plot in same code chunk with R markdown using results='asis'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With