I have a pandas data frame grouped by subject with multiple encounter dates per subject in the following organization:
row Subject encounter date difference
0 1 1/1/2015 0
1 1 1/10/2015 9
2 1 1/09/2016 364
3 2 2/8/2015 0
4 2 4/20/2015 71
5 2 3/19/2016 333
6 2 3/22/2016 3
7 2 3/20/2017 363
Output:
row Subject encounter date difference
0 1 1/1/2015 0
2 1 1/09/2016 374
3 2 2/8/2015 0
5 2 3/19/2016 404
7 2 3/20/2017 366
I would like to iterate over all rows grouped by subject, and remove rows where the time difference relative to the previous row is < 365, with active revision of the difference between rows after rows are removed. My current code will drop row 2 of the dataset, but I would like to revise such that the time difference is recalculated after rows are dropped -- in this case, when row 1 is dropped, the next encounter will be calculated against time 0 and will be > 365.
Here is my current code. Any help will be appreciated:
df = df.drop(df[(((df.groupby('Subject')['Encounter_Date'].diff().fillna(0)) / np.timedelta64(1, 'D')).astype(int) > 0) & (((df.groupby('Subject')['Encounter_Date'].diff().fillna(0)) / np.timedelta64(1, 'D')).astype(int) < 365)].index)
def drop_rows(date, subject):
current_subject = subject[0]
date_diff = date - date
j = 1
for i in range(1,len(date)):
date_diff[i] = {'subj': current_subject, 'diff': date[i] - date[i-j]}
# changed to dict
if subject[i] == current_subject:
if date_diff[i][2] < pd.Timedelta('365 Days'): # changed here
date_diff.drop(i,inplace=True)
j += 1
else:
j = 1
else:
date_diff[i][2] = pd.Timedelta('0 Days') # changed here
current_subject = subject[i]
return pd.DataFrame(data = date_diff, col = ['subj', 'diff']
Here's a bit of a hack, but seems to work. I added your code to handle grouping by subject and then changed in 3 places (noted below).
def drop_rows(date, subject):
current_subject = subject[0] # changed here
date_diff = date - date # timedelta=0, same shape as date
j = 1
for i in range(1,len(date)):
date_diff[i] = date[i] - date[i-j]
if subject[i] == current_subject:
if date_diff[i] < pd.Timedelta('365 Days'):
date_diff.drop(i,inplace=True)
j += 1
else:
j = 1
else:
date_diff[i] = pd.Timedelta('0 Days') # changed here
current_subject = subject[i] # changed here
return date_diff
Note, of course, that you need to have sorted by subject and date, and date is assumed to be of dtype datetime.
>>> drop_rows(df.date,df.Subject)
0 0 days
2 373 days
3 0 days
5 405 days
7 366 days
Name: date, dtype: timedelta64[ns]
To get a new dataframe with only the selected rows, you could do the following:
df['new'] = drop_rows(df.date,df.Subject)
df = df[ df['new'].notnull() ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With