Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining rows with overlapping time periods in a pandas dataframe

I am researching prescription habits and have large dataframes of sold products.

I am trying to transform purchases of medications into courses of the drugs by calculating how long the product would have lasted and adding a 5 day fudge factor for compliance, starting delays, etc to calculate an end date for the purchase.

I then want to combine prescriptions with overlapping date windows but I'm struggling to find an efficient way to do this. I was hoping a groupby would be possible but I can't figure out how to do this.

I know how to iterate over the dataframe to create a new dataframe with the relevant information, but it is a slow operation and I am hoping I can find a more elegant solution.

ID      start       end         ingredient  days    dose    end
1000    2018-10-03  2018-10-18  Metron...   10.0    125.00 
1000    2018-10-13  2018-10-25  Metron...   7.0     125.00 
1001    2018-03-08  2018-03-20  Cefalexin   7.0     150.00
1001    2018-09-17  2018-10-05  Cefalexin   13.0    150.00
1002    2018-05-18  2018-05-30  Amoxiclav   7.0     75.00
1002    2018-05-25  2018-06-06  Amoxiclav   7.0     100.00 
1003    2018-07-01  2018-07-16  Amoxiclav   10.0    50.00
1003    2018-07-15  2018-07-30  Amoxiclav   10.0    50.00 
1003    2018-07-25  2018-08-09  Amoxiclav   10.0    50.00 

My expected result is as follows:

ID      start       end         ingredient  days    dose
1000    2018-10-03  2018-10-25  Metron...   17.0    125.00
1001    2018-03-08  2018-03-20  Cefalexin   7.0     150.00
1001    2018-09-17  2018-10-05  Cefalexin   13.0    150.00
1002    2018-05-18  2018-05-30  Amoxiclav   7.0     75.00
1002    2018-05-25  2018-06-06  Amoxiclav   7.0     100.00 
1003    2018-07-01  2018-08-05  Amoxiclav   30.0    50.00

1000's second purchase was exactly 10 days in so the end date is the same as their second end date.

1001 did not overlap so remains as they are.

1002 overlaps on start and end dates but had a change in their dose so should not be combined.

1003 had 30 days worth in total. The start date of their final purchase is later than the end date of the first. Their end date should be 35 days after they first made a purchase. This is a negotiable criterion and an end date matching the final purchase's end date would be acceptable.

Am I barking up the wrong tree here? Must this be done iteratively?

like image 770
walkerbox Avatar asked Sep 05 '19 11:09

walkerbox


Video Answer


1 Answers

I think the biggest problem here is to identify when the time intervals are overlapping, the rest is just grouping and addition.

First, be sure (if not already done) to convert your dates to datetime and days in timedelta. This will help to compare dates and durations and perform some math on them.

df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df['days'] = pd.to_timedelta(df['days'], unit='D')

This code produces your expected result:

def join_times(x):
    startdf = pd.DataFrame({'time':x['start'], 'what':1})
    enddf = pd.DataFrame({'time':x['end'], 'what':-1})
    mergdf = pd.concat([startdf, enddf]).sort_values('time')
    mergdf['running'] = mergdf['what'].cumsum()
    mergdf['newwin'] = mergdf['running'].eq(1) & mergdf['what'].eq(1)
    mergdf['group'] = mergdf['newwin'].cumsum()
    x['group'] = mergdf['group'].loc[mergdf['what'].eq(1)]
    res = x.groupby('group').agg({'days':'sum', 'start':'first'})
    res['end'] = res.apply(lambda x : x['start'] + x['days'] + pd.to_timedelta(5, unit='D'), axis=1)
    return res

ddf = df.groupby(['ID', 'ingredient', 'dose']).apply(join_times).reset_index().drop('group', axis=1)

This needs to be explained. As you see, I use groupby to identify the subsamples. Then the job is done by the custom join_times function.

join_times function joins together in the same column of a single dataframe (column 'time') start and end times, sorted in order.
A second column 'what' marks with +1 starting times, and with -1 ending time. These are used to keep track of how many intervals are overlapping (in colum 'running' using cumsum()).
Then a boolean column 'newwin' is buildt to identify the beginning of a new non overlapping time interval and a column 'group' is buildt to mark with the same integer the rows belonging to the same overlapping time interval.

The a 'group' column is added to the original subsample, copying the values in the previously buildt 'group' column. Finally, we can identify, for each subsample, which rows have overlapping.
So we can use groupby again and sum the 'days' column, keeping the first date from 'start' column.
'end' column is calculated by adding to 'start' the duration 'days' plus 5 days.

The above code, using your data sample, gives:

     ID ingredient   dose    days      start        end
0  1000  Metron...  125.0 17 days 2018-10-03 2018-10-25
1  1001  Cefalexin  150.0  7 days 2018-03-08 2018-03-20
2  1001  Cefalexin  150.0 13 days 2018-09-17 2018-10-05
3  1002  Amoxiclav   75.0  7 days 2018-05-18 2018-05-30
4  1002  Amoxiclav  100.0  7 days 2018-05-25 2018-06-06
5  1003  Amoxiclav   50.0 30 days 2018-07-01 2018-08-05

Which is your expected result. Column order is different due to groupby operations with indexing.

like image 59
Valentino Avatar answered Sep 28 '22 04:09

Valentino