Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find rows with overlapping date ranges?

Tags:

python

pandas

I have a dataframe that contains data like below (tiny subset of data):

enter image description here

I'm trying to figure out a way where I can create a new dataframe that contains all rows that have the same values for : carrier, flightnumber, departureAirport and arrivalAirport but also have date ranges that overlap.

By overlap I mean the effectiveDate for one row falls between the effectiveDate and discontinuedDate for another record that has the same values for the other columns I mentioned.

So in my above example, the first two rows would be considered an example of this (and should both be included in the new dataframe), but the third row is not.

I'm assuming I want to use groupby, but I'm not entirely clear on what aggregation function I would apply. Below is what I have so far:

df.groupby(['carrier','flightnumber','departureAirport','arrivalAirport'])['effectiveDate', 'discontinuedDate'].min()

but obviously I need to apply a function that determines overlap instead of min(). How would I go about identifying overlap instead of returning the minimum values for this group?

UPDATE:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
4U      9748          DUS               GVA             2017-07-18    2017-08-27
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

UPDATE 2:

As far as output goes I'd like to have any rows that overlap and have the same values for carrier, flightnumber, departureAirport and arrivalAirport returned in a new dataframe. There does not need to be any additional data included for these rows. So for the above example data, a dataframe like the one below would be my desired output:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

Notice that only one record has been excluded (the third for 9748) - this is because it's date range does not overlap with other records for the same flight.

like image 764
Abe Miessler Avatar asked May 31 '17 15:05

Abe Miessler


People also ask

How do you calculate overlapping date intervals in Excel?

With SUMPRODUCT we can check if each start date is less than any of the end dates in the table AND, if each end date is greater than any of the start dates in the table. If the dates on each row meets this criteria for more than one set of dates in the table, then we know there are overlapping dates.

How do you know if two date ranges overlap in SQL?

You can do this by swapping the ranges if necessary up front. Then, you can detect overlap if the second range start is: less than or equal to the first range end (if ranges are inclusive, containing both the start and end times); or. less than (if ranges are inclusive of start and exclusive of end).

How do you calculate overlapping days?

data RESULTS; set EVENTS; OVERLAP = min(A2,B2) - max(A1,B1) + 1; if OVERLAP<0 then OVERLAP = 0; run; We can also zero out the negative values of variable x using max(0, x) expression that results in the following formula for the date ranges overlap calculation: Overlap = max(0, min(A2, B2) - max(A1, B1) + 1).


1 Answers

High Level Concept

  • Sort by all dates and then by prioritizing effectiveDate if there is exact overlap.
  • Cumulatively sum to alternating ones and negative ones that were initialized prior to sorting. The point is that an overlap happens when the cumulative sum is above 1. A contiguous group ends when the sum drops to 0.
  • Unsort the sorting and identify where zeros happen... these are the end of overlapping groups.
  • Split the dataframe index on these break points and only take the splits where the size of the split is greater than 1.
  • Concatenate the passing splits and use loc to get the sliced dataframe.

def overlaping_groups(df):
    n = len(df)
    cols = ['effectiveDate', 'discontinuedDate']
    v = np.column_stack([df[c].values for c in cols]).ravel()
    i = np.tile([1, -1], n)
    a = np.lexsort([-i, v])
    u = np.empty_like(a)
    u[a] = np.arange(a.size)
    e = np.flatnonzero(i[a].cumsum()[u][1::2] == 0)
    d = np.diff(np.append(-1, e))
    s = np.split(df.index.values, e[:-1] + 1)

    return df.loc[np.concatenate([g for j, g in enumerate(s) if d[j] > 1])]

gcols = ['carrier', 'flightnumber', 'departureAirport', 'arrivalAirport']
df.groupby(gcols, group_keys=False).apply(overlaping_groups)

  carrier  flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
0      4U          9748              DUS            GVA    2017-05-09       2017-07-12
1      4U          9748              DUS            GVA    2017-05-14       2017-07-16
3      AG          1234              SFO            DFW    2017-03-09       2017-05-12
4      AG          1234              SFO            DFW    2017-03-14       2017-05-16
like image 75
piRSquared Avatar answered Oct 16 '22 07:10

piRSquared