How can i know if two rows have overlap in their date range??
Input Dataframe:
| A | B | Start | End | Timestamp |
|---|---|---|---|---|
| A1 | B1 | 2022-01-15 | 2022-02-15 | 2021-05-17 |
| A1 | B1 | 2021-07-15 | 2021-10-17 | 2021-05-17 |
| A1 | B1 | 2021-07-30 | 2021-10-02 | 2021-05-16 |
| A1 | B2 | 2022-01-01 | 2023-01-01 | 2021-05-17 |
| A1 | B2 | 2021-06-02 | 2021-06-04 | 2021-05-16 |
| A2 | B3 | 2021-05-10 | 2021-05-12 | 2021-05-17 |
| A2 | B3 | 2021-04-10 | 2021-06-12 | 2021-05-16 |
| A2 | B4 | 2021-06-02 | 2021-06-04 | 2021-05-17 |
I want to know how I could identify if there is overlap between two rows of the same group (A and B). For example in rows 0, 1 and 2 they belong to the same group (A1 and B1) but there is only overlap between rows 1 and 2 since they share a part of the date range and therefore what I want is to keep the row that has the highest timestamp if there is overlap. But for example in rows 3 and 4, although they belong to the same group (A1 and B2) since there is no overlap between their dates, I want to keep both dates. How could it be done?
Expected Dataframe:
| A | B | Start | End | Timestamp |
|---|---|---|---|---|
| A1 | B1 | 2022-01-15 | 2022-02-15 | 2021-05-17 |
| A1 | B1 | 2021-07-15 | 2021-10-17 | 2021-05-17 |
| A1 | B2 | 2022-01-01 | 2023-01-01 | 2021-05-17 |
| A1 | B2 | 2021-06-02 | 2021-06-04 | 2021-05-16 |
| A2 | B3 | 2021-05-10 | 2021-05-12 | 2021-05-17 |
| A2 | B4 | 2021-06-02 | 2021-06-04 | 2021-05-17 |
You can use:
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.sort_values(['A', 'B', 'Start', 'End'])
A and B for non-overlapping date ranges: By comparing Start date in a sorted row is larger than End date of previous row (this will be non-overlapping date range). All group numbers are relative to within same groups of A and B.group = (df['Start'] > df.groupby(['A', 'B'])['End'].shift()).groupby([df['A'], df['B']]).cumsum()
A and B and overlapping group number (by newly created group) and get the index for highest Timestamp by idxmax(). Use .loc to filter for such rows (with highest Timestamp in groups) to retain. Finally .sort_index() to restore the sequence of original dataframe before sorting.df.loc[df.groupby([df['A'], df['B'], group])['Timestamp'].idxmax()].sort_index()
Result:
A B Start End Timestamp
0 A1 B1 2022-01-15 2022-02-15 2021-05-17
1 A1 B1 2021-07-15 2021-10-17 2021-05-17
3 A1 B2 2022-01-01 2023-01-01 2021-05-17
4 A1 B2 2021-06-02 2021-06-04 2021-05-16
5 A2 B3 2021-05-10 2021-05-12 2021-05-17
7 A2 B4 2021-06-02 2021-06-04 2021-05-17
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With