Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pad missing dates in a dataframe with several columns

Tags:

date

padding

r

How can I insert missing dates in a dataframe with more than 2 columns? In my data, each date has an observation between sp1 and sp2. If there is no observation between sp1 and sp2 on a day, that date is missing.

Here's what part of my df looks like:

the_date    sp1 sp2 win     loss    sp1_name    sp2_name
4/1/13      A   B   8       8         A_name    B_name
4/2/13      A   B   6       10        A_name    B_name
4/3/13      A   B   7       5         A_name    B_name
4/5/13      A   B   7       5         A_name    B_name
4/6/13      A   B   6       2         A_name    B_name
4/7/13      A   B   15      10        A_name    B_name
4/1/13      A   C   3       8         A_name    C_name
4/2/13      A   C   2       12        A_name    C_name
4/3/13      A   C   9       7         A_name    C_name
4/4/13      A   C   14      8         A_name    C_name
4/6/13      A   C   9       10        A_name    C_name
4/1/13      A   D   13      13        A_name    D_name
4/2/13      A   D   13      5         A_name    D_name
4/3/13      A   D   7       1         A_name    D_name
4/4/13      A   D   15      11        A_name    D_name
4/5/13      A   D   3       11        A_name    D_name
4/6/13      A   D   12      11        A_name    D_name
4/7/13      A   D   9       9         A_name    D_name

For instance, 4/4/13 for A-B is missing. What I want in my output is to insert those missing dates with all the corresponding columns and assign 0 to wins and losses. So my output will look like this the rows with * added):

the_date    sp1 sp2 win     loss    sp1_name    sp2_name
4/1/13      A   B   8       8         A_name    B_name
4/2/13      A   B   6       10        A_name    B_name
4/3/13      A   B   7       5         A_name    B_name
*4/4/13     A   B   0       0         A_name    B_name
4/5/13      A   B   7       5         A_name    B_name
4/6/13      A   B   6       2         A_name    B_name
4/7/13      A   B   15      10        A_name    B_name
4/1/13      A   C   3       8         A_name    C_name
4/2/13      A   C   2       12        A_name    C_name
4/3/13      A   C   9       7         A_name    C_name
4/4/13      A   C   14      8         A_name    C_name
*4/5/13     A   C   0       0         A_name    C_name
4/6/13      A   C   9       10        A_name    C_name
*4/7/13     A   C   0       0         A_name    C_name
4/1/13      A   D   13      13        A_name    D_name
4/2/13      A   D   13      5         A_name    D_name
4/3/13      A   D   7       1         A_name    D_name
4/4/13      A   D   15      11        A_name    D_name
4/5/13      A   D   3       11        A_name    D_name
4/6/13      A   D   12      11        A_name    D_name
4/7/13      A   D   9       9         A_name    D_name

I know that if we have a 2 column dataframe (values, date), we can pad the dataframe with missing dates by merging it with a full-range time. However, my dataframe has more than 2 columns.

Also, this is just part of my data, so I have other combinations for other dates:

sp1 sp2 
B    C
B    A
B    D
C    A
C    B
C    D
D    B
D    C
D    A

Any clues?

like image 990
ah25 Avatar asked Oct 27 '25 06:10

ah25


2 Answers

This is a dplyr approach. You may want to consider a data.table approach given that you have a large data set.

d <- read.table(textConnection("the_date    sp1 sp2 win     loss    sp1_name    sp2_name
4/1/13      A   B   8       8         A_name    B_name
4/2/13      A   B   6       10        A_name    B_name
4/3/13      A   B   7       5         A_name    B_name
4/5/13      A   B   7       5         A_name    B_name
4/6/13      A   B   6       2         A_name    B_name
4/7/13      A   B   15      10        A_name    B_name
4/1/13      A   C   3       8         A_name    C_name
4/2/13      A   C   2       12        A_name    C_name
4/3/13      A   C   9       7         A_name    C_name
4/4/13      A   C   14      8         A_name    C_name
4/6/13      A   C   9       10        A_name    C_name
4/1/13      A   D   13      13        A_name    D_name
4/2/13      A   D   13      5         A_name    D_name
4/3/13      A   D   7       1         A_name    D_name
4/4/13      A   D   15      11        A_name    D_name
4/5/13      A   D   3       11        A_name    D_name
4/6/13      A   D   12      11        A_name    D_name
4/7/13      A   D   9       9         A_name    D_name"),
stringsAsFactors = FALSE, header = TRUE)

d$the_date <- as.Date(d$the_date, "%m/%d/%y")

Update

I realize that my original answer below is not completely correct. For example, it did not pad group A C with a 4/7/13. With that in mind, I came up with a better and I think faster approach.

#Step one combine sp1 and sp2 into one group
d$group <- paste0(d$sp1,d$sp2)

#Step two find min and max date in the database

min_d <- min(d$the_date)
max_d <- max(d$the_date)

#Step three use dplyr
d %>%
  do(expand.grid(unique(.$group), seq(min_d, max_d, 1))) %>% 
  rename(group = Var1, the_date = Var2) %>%
  left_join(d) %>%
  arrange(group) %>%
  select(-group)

Original

#Step one combine sp1 and sp2 into one group
d$group <- paste0(d$sp1,d$sp2)

#Step two use dplyr.  
d %>%
  group_by(group) %>%
  summarise(min = min(the_date), max = max(the_date)) %>%
  rowwise() %>%
  do(data.frame(group = .$group, the_date = seq(.$min, .$max, 1))) %>%
  left_join(d) %>%
  select(-group)

In general your question is similar to this one. Check it out for more information/ideas.

like image 185
Jacob H Avatar answered Oct 29 '25 20:10

Jacob H


Here's a solution using pad and fill_by_value from padr:

library(dplyr)
library(tidyr)
library(padr)

df %>%
  mutate(the_date = as.Date(the_date, "%m/%d/%y")) %>%
  group_by(sp1, sp2) %>%
  pad() %>%              
  fill(sp1_name:sp2_name) %>%      
  fill_by_value(win, loss)

Result:

# A tibble: 20 x 7
# Groups:   sp1, sp2 [3]
     the_date    sp1    sp2   win  loss sp1_name sp2_name
       <date> <fctr> <fctr> <dbl> <dbl>   <fctr>   <fctr>
 1 2013-04-01      A      B     8     8   A_name   B_name
 2 2013-04-02      A      B     6    10   A_name   B_name
 3 2013-04-03      A      B     7     5   A_name   B_name
 4 2013-04-04      A      B     0     0   A_name   B_name
 5 2013-04-05      A      B     7     5   A_name   B_name
 6 2013-04-06      A      B     6     2   A_name   B_name
 7 2013-04-07      A      B    15    10   A_name   B_name
 8 2013-04-01      A      C     3     8   A_name   C_name
 9 2013-04-02      A      C     2    12   A_name   C_name
10 2013-04-03      A      C     9     7   A_name   C_name
11 2013-04-04      A      C    14     8   A_name   C_name
12 2013-04-05      A      C     0     0   A_name   C_name
13 2013-04-06      A      C     9    10   A_name   C_name
14 2013-04-01      A      D    13    13   A_name   D_name
15 2013-04-02      A      D    13     5   A_name   D_name
16 2013-04-03      A      D     7     1   A_name   D_name
17 2013-04-04      A      D    15    11   A_name   D_name
18 2013-04-05      A      D     3    11   A_name   D_name
19 2013-04-06      A      D    12    11   A_name   D_name
20 2013-04-07      A      D     9     9   A_name   D_name

Data:

df = structure(list(the_date = structure(c(1L, 2L, 3L, 5L, 6L, 7L, 
1L, 2L, 3L, 4L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("4/1/13", 
"4/2/13", "4/3/13", "4/4/13", "4/5/13", "4/6/13", "4/7/13"), class = "factor"), 
    sp1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), 
    sp2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("B", "C", "D"
    ), class = "factor"), win = c(8L, 6L, 7L, 7L, 6L, 15L, 3L, 
    2L, 9L, 14L, 9L, 13L, 13L, 7L, 15L, 3L, 12L, 9L), loss = c(8L, 
    10L, 5L, 5L, 2L, 10L, 8L, 12L, 7L, 8L, 10L, 13L, 5L, 1L, 
    11L, 11L, 11L, 9L), sp1_name = structure(c(1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A_name", class = "factor"), 
    sp2_name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("B_name", 
    "C_name", "D_name"), class = "factor")), .Names = c("the_date", 
"sp1", "sp2", "win", "loss", "sp1_name", "sp2_name"), class = "data.frame", row.names = c(NA, 
-18L))
like image 31
acylam Avatar answered Oct 29 '25 20:10

acylam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!