Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the number of consecutive days in data [duplicate]

I am working with a dataframe where each row is indexed by a date. These are ordered but incomplete, meaning that that there is a subset of uninterrupted sequential rows in the dataset. I am interested in grouping the rows that belong to a sequence and finding out the length (number of days) of each sequence. Rows that are not part of a sequence are disregarded (or could have length of 1). Consider the following toy data:

library(tidyverse)
library(lubridate)

# toy data
df <- tibble(day = as_date(c("2022-01-01", "2022-01-03", "2022-01-04", 
                             "2022-01-05", "2022-01-08","2022-01-10",
                             "2022-01-11", "2022-01-12","2022-01-13")),
             values = 1:9)

What I am trying to obtain is an aggregated dataframe that tells me each uninterrupted sequence of days in the data and their length such as this:

# ideal output
tibble(sequences = c("2022-01-03 - 2022-01-05", "2022-01-10 - 2022-01-12"),
       length = c(3, 4))

So far I have calculated the differences in days between each row by using lags and leads and then turned this into an additional column indicating whether a row is the start, mid or endpoint of an uninterrupted sequence of days:

# transform df
df %>% 
  # add lag and lead to find day differences
  mutate(last_day = lag(day), 
         next_day = lead(day),
         diff_last = day - last_day,
         diff_next = next_day - day) %>% 
  # flag days in sequence
  mutate(in_seq = if_else(
    condition = diff_last == 1| diff_next == 1, 
    true = TRUE, 
    false = FALSE)) %>% 
  # find position of days in sequence
  mutate(seq_position = case_when(
    diff_last > 1 & diff_next == 1 ~ "start",
    diff_last == 1 & diff_next > 1 ~ "end",
    diff_last == 1 & diff_next == 1 ~ "mid",
    is.na(diff_last) & diff_next == 1 ~ "start",
    diff_last == 1 & is.na(diff_next) ~ "end",
    is.na(diff_last) & is.na(diff_next) ~ NA_character_,
    TRUE ~ "none"
  ))

This is where I'm stuck. I suspect that I need to create an additional variable for grouping that contains the start and end of each sequence so that I can summarise the number of rows per group, but I do not know a way for creating such groups since it involves creating a string based on values from multiple columns.

like image 398
O René Avatar asked Oct 29 '25 19:10

O René


1 Answers

You could do:

df %>% 
  group_by(cumsum(c(0, diff(day) - 1))) %>%
  summarise(sequences = paste(first(day), last(day), sep = ' - '),
            length    = n()) %>%
  filter(length > 1) %>%
  select(sequences, length)

#> # A tibble: 2 x 2
#>   sequences               length
#>   <chr>                    <int>
#> 1 2022-01-03 - 2022-01-05      3
#> 2 2022-01-10 - 2022-01-13      4
like image 80
Allan Cameron Avatar answered Nov 01 '25 09:11

Allan Cameron