Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overlap in date range grouped dataframe

How can i know if two rows have overlap in their date range??

Input Dataframe:

A B Start End Timestamp
A1 B1 2022-01-15 2022-02-15 2021-05-17
A1 B1 2021-07-15 2021-10-17 2021-05-17
A1 B1 2021-07-30 2021-10-02 2021-05-16
A1 B2 2022-01-01 2023-01-01 2021-05-17
A1 B2 2021-06-02 2021-06-04 2021-05-16
A2 B3 2021-05-10 2021-05-12 2021-05-17
A2 B3 2021-04-10 2021-06-12 2021-05-16
A2 B4 2021-06-02 2021-06-04 2021-05-17

I want to know how I could identify if there is overlap between two rows of the same group (A and B). For example in rows 0, 1 and 2 they belong to the same group (A1 and B1) but there is only overlap between rows 1 and 2 since they share a part of the date range and therefore what I want is to keep the row that has the highest timestamp if there is overlap. But for example in rows 3 and 4, although they belong to the same group (A1 and B2) since there is no overlap between their dates, I want to keep both dates. How could it be done?

Expected Dataframe:

A B Start End Timestamp
A1 B1 2022-01-15 2022-02-15 2021-05-17
A1 B1 2021-07-15 2021-10-17 2021-05-17
A1 B2 2022-01-01 2023-01-01 2021-05-17
A1 B2 2021-06-02 2021-06-04 2021-05-16
A2 B3 2021-05-10 2021-05-12 2021-05-17
A2 B4 2021-06-02 2021-06-04 2021-05-17
like image 912
JDK Avatar asked Dec 16 '25 14:12

JDK


1 Answers

You can use:

  1. Optionally convert dates to datetime if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
  1. Sort values to facilitate date comparison between consecutive rows
df = df.sort_values(['A', 'B', 'Start', 'End'])
  1. Set group number within the same A and B for non-overlapping date ranges: By comparing Start date in a sorted row is larger than End date of previous row (this will be non-overlapping date range). All group numbers are relative to within same groups of A and B.
group = (df['Start'] >  df.groupby(['A', 'B'])['End'].shift()).groupby([df['A'], df['B']]).cumsum()
  1. Group by A and B and overlapping group number (by newly created group) and get the index for highest Timestamp by idxmax(). Use .loc to filter for such rows (with highest Timestamp in groups) to retain. Finally .sort_index() to restore the sequence of original dataframe before sorting.
df.loc[df.groupby([df['A'], df['B'], group])['Timestamp'].idxmax()].sort_index()

Result:

    A   B      Start        End  Timestamp
0  A1  B1 2022-01-15 2022-02-15 2021-05-17
1  A1  B1 2021-07-15 2021-10-17 2021-05-17
3  A1  B2 2022-01-01 2023-01-01 2021-05-17
4  A1  B2 2021-06-02 2021-06-04 2021-05-16
5  A2  B3 2021-05-10 2021-05-12 2021-05-17
7  A2  B4 2021-06-02 2021-06-04 2021-05-17
like image 138
SeaBean Avatar answered Dec 19 '25 02:12

SeaBean



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!