Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop duplicate rows from a pandas DataFrame whose timestamps are within a specified range or duration

I have a DataFrame like this:

Subject Verb    Object  Date
---------------------------------
Bill    Ate     Food    7/11/2015
Steve   Painted House   8/12/2011
Bill    Ate     Food    7/13/2015
Steve   Painted House   8/25/2011

I would like to drop all duplicates, where a duplicate is defined as having the same Subject, Verb, Object, and falls within an X day range (in my example: 5 days).

Subject Verb    Object  Date
---------------------------------
Bill    Ate     Food    7/11/2015
Steve   Painted House   8/12/2011
Steve   Painted House   8/25/2011

Neither instance of "Steve - Painted - House" is removed because they are outside of a 5 day window.

I know I can do this using some data structures and the iterrows method of the DataFrame, but is there a way to do this using Pandas drop_duplicates?

like image 681
Mike R Avatar asked Mar 07 '23 13:03

Mike R


1 Answers

Use duplicated + diff in conjunction with groupby to figure out what rows you want to remove.

c = ['Subject', 'Verb', 'Object']

def f(x):
    return x[c].duplicated() & x.Date.diff().dt.days.lt(5)

df = df.sort_values(c)
df[~df.groupby(c).apply(f).values]

  Subject     Verb Object       Date
0    Bill      Ate   Food 2015-07-11
1   Steve  Painted  House 2011-08-12
3   Steve  Painted  House 2011-08-25
like image 140
cs95 Avatar answered Apr 26 '23 12:04

cs95