Overlap in date range grouped dataframe

Question

How can i know if two rows have overlap in their date range??

Input Dataframe:

A	B	Start	End	Timestamp
A1	B1	2022-01-15	2022-02-15	2021-05-17
A1	B1	2021-07-15	2021-10-17	2021-05-17
A1	B1	2021-07-30	2021-10-02	2021-05-16
A1	B2	2022-01-01	2023-01-01	2021-05-17
A1	B2	2021-06-02	2021-06-04	2021-05-16
A2	B3	2021-05-10	2021-05-12	2021-05-17
A2	B3	2021-04-10	2021-06-12	2021-05-16
A2	B4	2021-06-02	2021-06-04	2021-05-17

I want to know how I could identify if there is overlap between two rows of the same group (A and B). For example in rows 0, 1 and 2 they belong to the same group (A1 and B1) but there is only overlap between rows 1 and 2 since they share a part of the date range and therefore what I want is to keep the row that has the highest timestamp if there is overlap. But for example in rows 3 and 4, although they belong to the same group (A1 and B2) since there is no overlap between their dates, I want to keep both dates. How could it be done?

Expected Dataframe:

A	B	Start	End	Timestamp
A1	B1	2022-01-15	2022-02-15	2021-05-17
A1	B1	2021-07-15	2021-10-17	2021-05-17
A1	B2	2022-01-01	2023-01-01	2021-05-17
A1	B2	2021-06-02	2021-06-04	2021-05-16
A2	B3	2021-05-10	2021-05-12	2021-05-17
A2	B4	2021-06-02	2021-06-04	2021-05-17

SeaBean · Accepted Answer

You can use:

Optionally convert dates to datetime if not already in datetime

df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

Sort values to facilitate date comparison between consecutive rows

df = df.sort_values(['A', 'B', 'Start', 'End'])

Set group number within the same A and B for non-overlapping date ranges: By comparing Start date in a sorted row is larger than End date of previous row (this will be non-overlapping date range). All group numbers are relative to within same groups of A and B.

group = (df['Start'] >  df.groupby(['A', 'B'])['End'].shift()).groupby([df['A'], df['B']]).cumsum()

Group by A and B and overlapping group number (by newly created group) and get the index for highest Timestamp by idxmax(). Use .loc to filter for such rows (with highest Timestamp in groups) to retain. Finally .sort_index() to restore the sequence of original dataframe before sorting.

df.loc[df.groupby([df['A'], df['B'], group])['Timestamp'].idxmax()].sort_index()

Result:

    A   B      Start        End  Timestamp
0  A1  B1 2022-01-15 2022-02-15 2021-05-17
1  A1  B1 2021-07-15 2021-10-17 2021-05-17
3  A1  B2 2022-01-01 2023-01-01 2021-05-17
4  A1  B2 2021-06-02 2021-06-04 2021-05-16
5  A2  B3 2021-05-10 2021-05-12 2021-05-17
7  A2  B4 2021-06-02 2021-06-04 2021-05-17

Overlap in date range grouped dataframe

Tags:

python

date

pandas

dataframe

JDK

1 Answers

SeaBean

Recent Activity

Donate For Us

Overlap in date range grouped dataframe

Tags:

python

date

pandas

dataframe

JDK

1 Answers

SeaBean

Related questions

Recent Activity

Donate For Us