I have a dataframe that contains data like below (tiny subset of data):
I'm trying to figure out a way where I can create a new dataframe that contains all rows that have the same values for : carrier
, flightnumber
, departureAirport
and arrivalAirport
but also have date ranges that overlap.
By overlap I mean the effectiveDate
for one row falls between the effectiveDate
and discontinuedDate
for another record that has the same values for the other columns I mentioned.
So in my above example, the first two rows would be considered an example of this (and should both be included in the new dataframe), but the third row is not.
I'm assuming I want to use groupby, but I'm not entirely clear on what aggregation function I would apply. Below is what I have so far:
df.groupby(['carrier','flightnumber','departureAirport','arrivalAirport'])['effectiveDate', 'discontinuedDate'].min()
but obviously I need to apply a function that determines overlap instead of min()
. How would I go about identifying overlap instead of returning the minimum values for this group?
UPDATE:
carrier flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
4U 9748 DUS GVA 2017-05-09 2017-07-12
4U 9748 DUS GVA 2017-05-14 2017-07-16
4U 9748 DUS GVA 2017-07-18 2017-08-27
AG 1234 SFO DFW 2017-03-09 2017-05-12
AG 1234 SFO DFW 2017-03-14 2017-05-16
UPDATE 2:
As far as output goes I'd like to have any rows that overlap and have the same values for carrier
, flightnumber
, departureAirport
and arrivalAirport
returned in a new dataframe. There does not need to be any additional data included for these rows. So for the above example data, a dataframe like the one below would be my desired output:
carrier flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
4U 9748 DUS GVA 2017-05-09 2017-07-12
4U 9748 DUS GVA 2017-05-14 2017-07-16
AG 1234 SFO DFW 2017-03-09 2017-05-12
AG 1234 SFO DFW 2017-03-14 2017-05-16
Notice that only one record has been excluded (the third for 9748
) - this is because it's date range does not overlap with other records for the same flight.
With SUMPRODUCT we can check if each start date is less than any of the end dates in the table AND, if each end date is greater than any of the start dates in the table. If the dates on each row meets this criteria for more than one set of dates in the table, then we know there are overlapping dates.
You can do this by swapping the ranges if necessary up front. Then, you can detect overlap if the second range start is: less than or equal to the first range end (if ranges are inclusive, containing both the start and end times); or. less than (if ranges are inclusive of start and exclusive of end).
data RESULTS; set EVENTS; OVERLAP = min(A2,B2) - max(A1,B1) + 1; if OVERLAP<0 then OVERLAP = 0; run; We can also zero out the negative values of variable x using max(0, x) expression that results in the following formula for the date ranges overlap calculation: Overlap = max(0, min(A2, B2) - max(A1, B1) + 1).
High Level Concept
effectiveDate
if there is exact overlap.1
. A contiguous group ends when the sum drops to 0
.1
.loc
to get the sliced dataframe.def overlaping_groups(df):
n = len(df)
cols = ['effectiveDate', 'discontinuedDate']
v = np.column_stack([df[c].values for c in cols]).ravel()
i = np.tile([1, -1], n)
a = np.lexsort([-i, v])
u = np.empty_like(a)
u[a] = np.arange(a.size)
e = np.flatnonzero(i[a].cumsum()[u][1::2] == 0)
d = np.diff(np.append(-1, e))
s = np.split(df.index.values, e[:-1] + 1)
return df.loc[np.concatenate([g for j, g in enumerate(s) if d[j] > 1])]
gcols = ['carrier', 'flightnumber', 'departureAirport', 'arrivalAirport']
df.groupby(gcols, group_keys=False).apply(overlaping_groups)
carrier flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
0 4U 9748 DUS GVA 2017-05-09 2017-07-12
1 4U 9748 DUS GVA 2017-05-14 2017-07-16
3 AG 1234 SFO DFW 2017-03-09 2017-05-12
4 AG 1234 SFO DFW 2017-03-14 2017-05-16
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With