How to find rows with overlapping date ranges?

Tags:

pandas

I have a dataframe that contains data like below (tiny subset of data):

enter image description here

I'm trying to figure out a way where I can create a new dataframe that contains all rows that have the same values for : carrier, flightnumber, departureAirport and arrivalAirport but also have date ranges that overlap.

By overlap I mean the effectiveDate for one row falls between the effectiveDate and discontinuedDate for another record that has the same values for the other columns I mentioned.

So in my above example, the first two rows would be considered an example of this (and should both be included in the new dataframe), but the third row is not.

I'm assuming I want to use groupby, but I'm not entirely clear on what aggregation function I would apply. Below is what I have so far:

df.groupby(['carrier','flightnumber','departureAirport','arrivalAirport'])['effectiveDate', 'discontinuedDate'].min()

but obviously I need to apply a function that determines overlap instead of min(). How would I go about identifying overlap instead of returning the minimum values for this group?

UPDATE:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
4U      9748          DUS               GVA             2017-07-18    2017-08-27
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

UPDATE 2:

As far as output goes I'd like to have any rows that overlap and have the same values for carrier, flightnumber, departureAirport and arrivalAirport returned in a new dataframe. There does not need to be any additional data included for these rows. So for the above example data, a dataframe like the one below would be my desired output:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

Notice that only one record has been excluded (the third for 9748) - this is because it's date range does not overlap with other records for the same flight.

764

asked May 31 '17 15:05

Abe Miessler

1 Answers

High Level Concept

Sort by all dates and then by prioritizing effectiveDate if there is exact overlap.
Cumulatively sum to alternating ones and negative ones that were initialized prior to sorting. The point is that an overlap happens when the cumulative sum is above 1. A contiguous group ends when the sum drops to 0.
Unsort the sorting and identify where zeros happen... these are the end of overlapping groups.
Split the dataframe index on these break points and only take the splits where the size of the split is greater than 1.
Concatenate the passing splits and use loc to get the sliced dataframe.

def overlaping_groups(df):
    n = len(df)
    cols = ['effectiveDate', 'discontinuedDate']
    v = np.column_stack([df[c].values for c in cols]).ravel()
    i = np.tile([1, -1], n)
    a = np.lexsort([-i, v])
    u = np.empty_like(a)
    u[a] = np.arange(a.size)
    e = np.flatnonzero(i[a].cumsum()[u][1::2] == 0)
    d = np.diff(np.append(-1, e))
    s = np.split(df.index.values, e[:-1] + 1)

    return df.loc[np.concatenate([g for j, g in enumerate(s) if d[j] > 1])]

gcols = ['carrier', 'flightnumber', 'departureAirport', 'arrivalAirport']
df.groupby(gcols, group_keys=False).apply(overlaping_groups)

  carrier  flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
0      4U          9748              DUS            GVA    2017-05-09       2017-07-12
1      4U          9748              DUS            GVA    2017-05-14       2017-07-16
3      AG          1234              SFO            DFW    2017-03-09       2017-05-12
4      AG          1234              SFO            DFW    2017-03-14       2017-05-16

answered Oct 16 '22 07:10

piRSquared

Related questions
                            
                                Matplotlib: tick labels are inconsist with font setting (LaTeX text example)
                            
                                Element-wise multiplication in CVXPY
                            
                                how to open a menu programmatically in python tkinter?
                            
                                Type hint as logical-and of multiple types
                            
                                How to modify full text of some columns in pandas
                            
                                Is there a faster alternative to Python's strftime?
                            
                                How to make a Matplotlib animated violinplot?
                            
                                Python: How to fill out form all at once with splinter/Browser?
                            
                                How to import from sibling module in a package?
                            
                                pandas datetime set Sunday as first day of the week
                            
                                Object needs to have a value for field "id" before this many-to-many relationship can be used in Django
                            
                                Tkinter - Getting values from spinbox
                            
                                Convert CountVectorizer and TfidfTransformer Sparse Matrices into Separate Pandas Dataframe Rows
                            
                                How to add a legend to matplotlib scatter plot
                            
                                Google StackDrive Logging Level in containers with uwsgi always at Error Level
                            
                                Flask: session max size too small
                            
                                Statsmodels ARMA training data vs test data for prediction
                            
                                How to set a timeout for Input
                            
                                python - web scraping an ajax website using BeautifulSoup
                            
                                How to call ctypes functions that use pointer to return value in Numba @jit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With