I am developing a <code>tidyverse</code>-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame <code>my_time_intervals</code>, and it can be reproduced like this: <pre class="prettyprint"><code>library(tidyverse) library(lubridate) my_time_intervals <- tribble( ~id, ~group, ~start_time, ~end_time, 1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"), 2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"), 3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"), 4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"), 5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"), 6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"), 7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"), 8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42") ) </code></pre> Here's a <code>tibble</code> view of the same data frame: <pre class="prettyprint"><code>> my_time_intervals # A tibble: 8 x 4 id group start_time end_time <int> <int> <dttm> <dttm> 1 1 1 2018-04-12 11:15:03 2018-05-14 02:32:10 2 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01 3 3 1 2018-05-07 13:02:04 2018-05-23 08:13:06 4 4 2 2018-02-28 17:43:29 2018-04-20 03:48:40 5 5 2 2018-04-20 01:19:52 2018-08-12 12:56:37 6 6 2 2018-04-18 20:47:22 2018-04-19 16:07:29 7 7 2 2018-10-02 14:08:03 2018-11-08 00:01:23 8 8 3 2018-03-11 22:30:51 2018-10-20 21:01:42 </code></pre> A few notes about <code>my_time_intervals</code>: <ol> <li> The data is divided into three groups via the <code>group</code> variable. </li> <li> The <code>id</code> variable is just a unique ID for each row in the data frame. </li> <li> The start and end of time intervals are stored in <code>start_time</code> and <code>end_time</code> in <code>lubridate</code> form. </li> <li> Some time intervals overlap, some don't, and they are not always in order. For example, row <code>1</code> overlaps with row <code>3</code>, but neither of them overlaps with row <code>2</code>. </li> <li> More than two intervals may overlap with each other, and some intervals fall completely within others. See rows <code>4</code> through <code>6</code> in <code>group == 2</code>. </li> </ol> What I want is that within each <code>group</code>, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like: <pre class="prettyprint"><code># A tibble: 5 x 4 id group start_time end_time <int> <int> <dttm> <dttm> 1 1 1 2018-04-12 11:15:03 2018-05-23 08:13:06 2 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01 3 4 2 2018-02-28 17:43:29 2018-08-12 12:56:37 4 7 2 2018-10-02 14:08:03 2018-11-08 00:01:23 5 8 3 2018-03-11 22:30:51 2018-10-20 21:01:42 </code></pre> Notice that time intervals that overlap between different <code>group</code>s are not merged. Also, I don't care about what happens to the <code>id</code> column at this point. I know that the <code>lubridate</code> package includes interval-related functions, but I can't figure out how to apply them to this use case. How can I achieve this?

Another <code>tidyverse</code> method: <pre class="prettyprint"><code>library(tidyverse) library(lubridate) my_time_intervals %>% arrange(group, start_time) %>% group_by(group) %>% mutate(new_end_time = if_else(end_time >= lead(start_time), lead(end_time), end_time), g = new_end_time != end_time | is.na(new_end_time), end_time = if_else(end_time != new_end_time & !is.na(new_end_time), new_end_time, end_time)) %>% filter(g) %>% select(-new_end_time, -g) </code></pre>

We could sort by <code>start_time</code>, then nest and use reduce in subtables to merge rows when relevant (using Masoud's data) : <pre class="prettyprint"><code>library(tidyverse) df %>% arrange(start_time) %>% # select(-id) %>% nest(start_time, end_time,.key="startend") %>% mutate(startend = map(startend,~reduce( seq(nrow(.))[-1], ~ if(..3[.y,1] <= .x[nrow(.x),2]) if(..3[.y,2] > .x[nrow(.x),2]) `[<-`(.x, nrow(.x), 2, value = ..3[.y,2]) else .x else bind_rows(.x,..3[.y,]), .init = .[1,], .))) %>% arrange(group) %>% unnest() # # A tibble: 7 x 3 # group start_time end_time # <int> <dttm> <dttm> # 1 1 2018-04-12 13:15:03 2018-05-23 10:13:06 # 2 1 2018-07-04 04:53:20 2018-07-14 20:09:01 # 3 1 2018-07-15 03:53:20 2018-07-19 20:09:01 # 4 1 2018-07-20 04:53:20 2018-07-22 20:09:01 # 5 2 2018-02-28 18:43:29 2018-08-12 14:56:37 # 6 2 2018-10-02 16:08:03 2018-11-08 01:01:23 # 7 3 2018-03-11 23:30:51 2018-10-20 23:01:42 </code></pre>

Collapse and merge overlapping time intervals

Tags:

datetime

dataframe

r

lubridate

tidyverse

I am developing a tidyverse-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame my_time_intervals, and it can be reproduced like this:

library(tidyverse)
library(lubridate)

my_time_intervals <- tribble(
    ~id, ~group, ~start_time, ~end_time,
    1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
    2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
    3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
    4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
    5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
    6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
    7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
    8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

Here's a tibble view of the same data frame:

> my_time_intervals
# A tibble: 8 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-14 02:32:10
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     3     1 2018-05-07 13:02:04 2018-05-23 08:13:06
4     4     2 2018-02-28 17:43:29 2018-04-20 03:48:40
5     5     2 2018-04-20 01:19:52 2018-08-12 12:56:37
6     6     2 2018-04-18 20:47:22 2018-04-19 16:07:29
7     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
8     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

A few notes about my_time_intervals:

The data is divided into three groups via the group variable.
The id variable is just a unique ID for each row in the data frame.
The start and end of time intervals are stored in start_time and end_time in lubridate form.
Some time intervals overlap, some don't, and they are not always in order. For example, row 1 overlaps with row 3, but neither of them overlaps with row 2.
More than two intervals may overlap with each other, and some intervals fall completely within others. See rows 4 through 6 in group == 2.

What I want is that within each group, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like:

# A tibble: 5 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     4     2 2018-02-28 17:43:29 2018-08-12 12:56:37
4     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
5     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Notice that time intervals that overlap between different groups are not merged. Also, I don't care about what happens to the id column at this point.

I know that the lubridate package includes interval-related functions, but I can't figure out how to apply them to this use case.

How can I achieve this?

399

asked Nov 08 '18 17:11

hpy

3 Answers

my_time_intervals %>% 
  group_by(group) %>% arrange(start_time, by_group = TRUE) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), 
            end_time = max(end_time)) %>%
  select(-indx)


# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))


  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

So...

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 


# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Hope this helps, as it is my best.

135

answered Oct 07 '22 10:10

M--

Another tidyverse method:

library(tidyverse)
library(lubridate)

my_time_intervals %>%
  arrange(group, start_time) %>%
  group_by(group) %>%
  mutate(new_end_time = if_else(end_time >= lead(start_time), lead(end_time), end_time),
         g = new_end_time != end_time | is.na(new_end_time),
         end_time = if_else(end_time != new_end_time & !is.na(new_end_time), new_end_time, end_time)) %>%
  filter(g) %>%
  select(-new_end_time, -g)

answered Oct 07 '22 10:10

acylam

We could sort by start_time, then nest and use reduce in subtables to merge rows when relevant (using Masoud's data) :

library(tidyverse)
df %>% 
  arrange(start_time) %>% # 
  select(-id) %>%
  nest(start_time, end_time,.key="startend") %>%
  mutate(startend = map(startend,~reduce(
    seq(nrow(.))[-1],
    ~ if(..3[.y,1] <= .x[nrow(.x),2]) 
        if(..3[.y,2] > .x[nrow(.x),2]) `[<-`(.x, nrow(.x), 2, value = ..3[.y,2])
        else .x
      else bind_rows(.x,..3[.y,]),
    .init = .[1,],
    .))) %>%
  arrange(group) %>%
  unnest()

# # A tibble: 7 x 3
# group          start_time            end_time
# <int>              <dttm>              <dttm>
# 1     1 2018-04-12 13:15:03 2018-05-23 10:13:06
# 2     1 2018-07-04 04:53:20 2018-07-14 20:09:01
# 3     1 2018-07-15 03:53:20 2018-07-19 20:09:01
# 4     1 2018-07-20 04:53:20 2018-07-22 20:09:01
# 5     2 2018-02-28 18:43:29 2018-08-12 14:56:37
# 6     2 2018-10-02 16:08:03 2018-11-08 01:01:23
# 7     3 2018-03-11 23:30:51 2018-10-20 23:01:42

answered Oct 07 '22 10:10

Moody_Mudskipper

Related questions
                            
                                Correlation Matrix - tidyr gather v. reshape2 melt
                            
                                dplyr number of rows across groups after filtering
                            
                                How to put plots without any space using plot_grid?
                            
                                convert all factor columns to character in a data.frame without affecting non-factor columns
                            
                                How to plot dataframe in R as a heatmap/grid?
                            
                                How can I add a message box in R Shiny?
                            
                                Using stat_function to draw partially shaded normal curve in ggplot2
                            
                                R data.table compute new column, but insert at beginning
                            
                                In R, using melt(), how can I hide warning messages?
                            
                                Does `tfread` exist?
                            
                                Rank most recent scores of students within a given date - 30 days window
                            
                                Is this what rnorm(x) does if x is a vector, and how could I have found out faster?
                            
                                Overlaying line on contour plot using Plotly
                            
                                Find all records which have multiple values in a column in R
                            
                                Summarizing data by name separated across multiple variables
                            
                                How to find the column means for a sparse matrix excluding 0 values?
                            
                                Why isn't column-wise operation much faster than row-wise operation (as it should be) for a matrix in R
                            
                                Formatted latex regression tables with multiple models from broom output?
                            
                                "Set Difference" between two vectors with duplicate values
                            
                                Difference between `paste`, `str_c`, `str_join`, `stri_join`, `stri_c`, `stri_paste`?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Collapse and merge overlapping time intervals

Tags:

datetime

dataframe

r

lubridate

tidyverse

hpy

People also ask

3 Answers

Explanation per OP's request:

M--

acylam

Moody_Mudskipper

Recent Activity

Donate For Us