Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas get previous dataframe row by date

I'm working with some data where I have to get the date of occurrence. For example, say we're working with medical data. Each row is a unique visit from a patient, though the same patient can have multiple rows. Each row also contains info on the type of visit, whether it was routine or emergency room.

I need to go through, and for each row, get the date that the patient was previously admitted to the emergency room, prior to that visit. For example, I'd like to add a column previous_er_discharge_date as below:

visit_id    patient_id    discharge_date    visit_type    previous_er_discharge_date
1           abc           2014-05-05        in-patient    2014-05-01
2           abc           2014-05-01        emergency     NaT
3           def           2014-04-18        in-patient    NaT
4           def           2014-03-12        in-patient    2014-02-12
5           def           2014-02-12        emergency     NaT

So I have something working, but it's very slow. I basically just create a separate data frame of only ER visits, and iterate through the main data frame, and finding whether previous ER dates exist for that patient, and if they do I take the first one. (The data is sorted by discharge_date). A general representation of the code I have.

er_visits = main_data[main_data.visit_type=='emergency']
prev_dates = []
for index, row in main_data.iterrows():
    dates = er_visits.discharge_date[(er_visits.patient_id==row.patient_id) &
                                     (er_visits.discharge_date < row.discharge_date)].values

    if len(dates) > 0:
        prev_dates.append(dates[0])
    else:
        prev_dates.append(pd.NaT)

The above code works, but it's slow, and I was hoping to get help in finding faster ways to do this. The data set I'm working with has several hundred thousand rows, and this is something that has to run everyday.

Thanks!

like image 501
Binal Patel Avatar asked Dec 06 '22 21:12

Binal Patel


1 Answers

In pandas, you basically want to avoid loops, as they kill performance.

Her's a DataFrame similar to yours (I was lazy about the dates, so they're ints; it's the same idea).

df = pd.DataFrame({
    'id': ['abc', 'abc', 'def', 'def', 'def'],
    'date': [505, 501, 418, 312, 212]})

And here's a function that, for each group, appends the previous date:

def prev_dates(g):
    g.sort(columns=['date'])
    g['prev'] = g.date.shift(-1)
    return g

So all that's needed is to connect things:

 >> df.groupby(df.id).apply(prev_dates)
    date    id  prev
 0  505     abc     501
 1  501     abc     NaN
 2  418     def     312
 3  312     def     212
 4  212     def     NaN

Edit

As noted by @julius below, sort(columns= has since been deprecated, and should be replaced by ``sort_values(by=''.

like image 83
Ami Tavory Avatar answered Dec 19 '22 07:12

Ami Tavory