Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: dplyr group by date range

Tags:

r

group-by

dplyr

I'm trying to group a data frame by 3 date ranges based on "2016-04-10" and "2016-04-24".

df <- structure(list(date = structure(c(16803, 16810, 16817, 16824, 
16831, 16838, 16845, 16852, 16859, 16866, 16873, 16880, 16887, 
16894, 16901, 16908, 16915, 16922, 16929, 16936, 16943), class = "Date"), 
    new = c(1507L, 2851L, 3550L, 5329L, 7557L, 5546L, 6264L, 
    7160L, 9468L, 5789L, 5928L, 4642L, 8145L, 4867L, 4846L, 5231L, 
    7137L, 3938L, 3741L, 2937L, 194L), resolved = c(21, 27, 15, 
    16, 56, 2773, 8490, 8748, 9325, 7734, 10264, 6739, 6110, 
    9613, 10314, 10349, 7200, 9637, 10831, 11170, 5666), ost = c(1486, 
    2824, 3535, 5313, 7501, 2773, -2226, -1588, 143, -1945, -4336, 
    -2097, 2035, -4746, -5468, -5118, -63, -5699, -7090, -8233, 
    -5472)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-21L), .Names = c("date", "new", "resolved", "ost"))

Tried the following:

df1 <- df %>% group_by(dr=cut(date,breaks=as.Date(c("2016-04-10","2016-04-24")))) %>%
                summarise(ost = sum(ost))

Which gives the wrong result as below.

        dr    ost
2016-04-10 -10586
        NA -17885

Help appreciated!

like image 254
woshishui Avatar asked May 25 '16 06:05

woshishui


2 Answers

We create a grouping variable 'dr' with cut. The breaks mentioned are the range of the 'date' i.e. the min and max values of the 'date' along with the dates specified by the OP, concatenate it (c), use the option include.lowest and get the sum of 'ost' based on this grouping variable.

df %>%
  group_by(dr = cut(date, breaks = c(range(date), 
            as.Date(c("2016-04-10", "2016-04-24"))), include.lowest=TRUE) ) %>% 
  summarise(ost =sum(ost))
#         dr    ost
#     <fctr>  <dbl>
#1 2016-01-03   8672
#2 2016-04-10 -10586
#3 2016-04-24 -26557

Or another option is findInterval which could be faster compared to cut.

df %>%
  group_by(dr = findInterval(date, as.Date(c("2016-04-10", "2016-04-24")))) %>% 
  summarise(ost = sum(ost))
#     dr    ost
#  <int>  <dbl>
#1     0   8672
#2     1 -10586
#3     2 -26557

NOTE: The OP asked the question about cut and this solution gives that.

like image 128
akrun Avatar answered Oct 12 '22 05:10

akrun


You can create a grouping variable first,

df %>% 
mutate(group = cumsum(grepl('2016-04-10|2016-04-24', date))) %>%
group_by(group) %>% 
summarise(ost = sum(ost))

#Source: local data frame [3 x 2]

#  group    ost
#  (int)  (dbl)
#1     0   8672
#2     1 -10586
#3     2 -26557
like image 24
Sotos Avatar answered Oct 12 '22 03:10

Sotos