Given a data.frame that contains a time series and one or ore grouping fields. So we have several time series - one for each grouping combination. But some dates are missing. So, what's the easiest (in terms of the most "tidyverse way") of adding these dates with the right grouping values?
Normally I would say I generate a data.frame with all dates and do a full_join with my time series. But now we have to do it for each combination of grouping values -- and fill in the grouping values.
Let's look at an example:
First I create a data.frame with missing values:
library(dplyr)
library(lubridate)
set.seed(1234)
# Time series should run vom 2017-01-01 til 2017-01-10
date <- data.frame(date = seq.Date(from=ymd("2017-01-01"), to=ymd("2017-01-10"), by="days"), v = 1)
# Two grouping dimensions
d1 <- data.frame(d1 = c("A", "B", "C", "D"), v = 1)
d2 <- data.frame(d2 = c(1, 2, 3, 4, 5), v = 1)
# Generate the data.frame
df <- full_join(date, full_join(d1, d2)) %>%
select(date, d1, d2)
# and ad to value columns
df$v1 <- runif(200)
df$v2 <- runif(200)
# group by the dimension columns
df <- df %>%
group_by(d1, d2)
# create missing dates
df.missing <- df %>%
filter(v1 <= 0.8)
# So now 2017-01-01 and 2017-01-10, A, 5 are missing now
df.missing %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 8 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
So to add the missing dates I generate a data.frame with all dates:
start <- min(df.missing$date)
end <- max(df.missing$date)
all.dates <- data.frame(date=seq.Date(start, end, by="day"))
No I want to do something like (remember: df.missing is group_by(d1, d2))
df.missing %>%
do(my_join())
So let's define my_join():
my_join <- function(data) {
# get value of both dimensions
d1.set <- data$d1[[1]]
d2.set <- data$d2[[1]]
tmp <- full_join(data, all.dates) %>%
# First we need to ungroup. Otherwise we can't change d1 and d2 because they are grouping variables
ungroup() %>%
mutate(
d1 = d1.set,
d2 = d2.set
) %>%
group_by(d1, d2)
return(tmp)
}
Now we can call my_join() for each combination and have a look at "A/5"
df.missing %>%
do(my_join(.)) %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 10 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
9 2017-01-01 A 5 NA NA
10 2017-01-10 A 5 NA NA
Great! That's what we were looking for. But we need to define d1 and d2 in my_join and it feels a little bit clumsy.
So, is there any tidyverse-way of this solution?
P.S.: I've put the code into a gist: https://gist.github.com/JerryWho/1bf919ef73792569eb38f6462c6d7a8e
The fillna() function is used to fill NA/NaN values using the specified method.
tidyr
has some great tools for these sorts of problems. Take a look at complete
.
library(dplyr)
library(tidyr)
library(lubridate)
want <- df.missing %>%
ungroup() %>%
complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))
want %>% filter(d1 == "A" & d2 == 5)
#> # A tibble: 10 x 5
#> d1 d2 date v1 v2
#> <fctr> <dbl> <date> <dbl> <dbl>
#> 1 A 5 2017-01-01 NA NA
#> 2 A 5 2017-01-02 0.21879954 0.1335497
#> 3 A 5 2017-01-03 0.32977018 0.9802127
#> 4 A 5 2017-01-04 0.23902573 0.1206089
#> 5 A 5 2017-01-05 0.19617465 0.7378315
#> 6 A 5 2017-01-06 0.13373890 0.9493668
#> 7 A 5 2017-01-07 0.48613541 0.3392834
#> 8 A 5 2017-01-08 0.35698708 0.3696965
#> 9 A 5 2017-01-09 0.08498474 0.8354756
#> 10 A 5 2017-01-10 NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With