Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

drop values on datetime difference, with revision

Tags:

python

pandas

I have a pandas data frame grouped by subject with multiple encounter dates per subject in the following organization:

row    Subject    encounter date    difference
0      1          1/1/2015          0
1      1          1/10/2015         9
2      1          1/09/2016         364
3      2          2/8/2015          0
4      2          4/20/2015         71
5      2          3/19/2016         333
6      2          3/22/2016         3
7      2          3/20/2017         363

Output:

row    Subject    encounter date    difference
0      1          1/1/2015          0
2      1          1/09/2016         374
3      2          2/8/2015          0
5      2          3/19/2016         404
7      2          3/20/2017         366

I would like to iterate over all rows grouped by subject, and remove rows where the time difference relative to the previous row is < 365, with active revision of the difference between rows after rows are removed. My current code will drop row 2 of the dataset, but I would like to revise such that the time difference is recalculated after rows are dropped -- in this case, when row 1 is dropped, the next encounter will be calculated against time 0 and will be > 365.

Here is my current code. Any help will be appreciated:

df = df.drop(df[(((df.groupby('Subject')['Encounter_Date'].diff().fillna(0)) / np.timedelta64(1, 'D')).astype(int) > 0) & (((df.groupby('Subject')['Encounter_Date'].diff().fillna(0)) / np.timedelta64(1, 'D')).astype(int) < 365)].index)


 def drop_rows(date, subject):
    current_subject = subject[0]
    date_diff = date - date   
    j = 1
    for i in range(1,len(date)):
        date_diff[i] = {'subj': current_subject, 'diff': date[i] - date[i-j]}
                                                         # changed to dict
        if subject[i] == current_subject:
            if date_diff[i][2] < pd.Timedelta('365 Days'):    # changed here
                date_diff.drop(i,inplace=True)
                j += 1
            else:
                j = 1
        else:
            date_diff[i][2] = pd.Timedelta('0 Days')          # changed here
            current_subject = subject[i]            
    return pd.DataFrame(data = date_diff, col = ['subj', 'diff'] 
like image 873
AMS Avatar asked Jun 29 '26 12:06

AMS


1 Answers

Here's a bit of a hack, but seems to work. I added your code to handle grouping by subject and then changed in 3 places (noted below).

def drop_rows(date, subject):
    current_subject = subject[0] # changed here
    date_diff = date - date      # timedelta=0, same shape as date
    j = 1
    for i in range(1,len(date)):
        date_diff[i] = date[i] - date[i-j]
        if subject[i] == current_subject:
            if date_diff[i] < pd.Timedelta('365 Days'):
                date_diff.drop(i,inplace=True)
                j += 1
            else:
                j = 1
        else:
            date_diff[i] = pd.Timedelta('0 Days')    # changed here
            current_subject = subject[i]             # changed here
    return date_diff

Note, of course, that you need to have sorted by subject and date, and date is assumed to be of dtype datetime.

>>> drop_rows(df.date,df.Subject)

0     0 days
2   373 days
3     0 days
5   405 days
7   366 days
Name: date, dtype: timedelta64[ns]

To get a new dataframe with only the selected rows, you could do the following:

df['new'] = drop_rows(df.date,df.Subject)
df = df[ df['new'].notnull() ]
like image 158
JohnE Avatar answered Jul 01 '26 02:07

JohnE