I have a DataFrame like this:
Subject Verb Object Date
---------------------------------
Bill Ate Food 7/11/2015
Steve Painted House 8/12/2011
Bill Ate Food 7/13/2015
Steve Painted House 8/25/2011
I would like to drop all duplicates, where a duplicate is defined as having the same Subject, Verb, Object, and falls within an X day range (in my example: 5 days).
Subject Verb Object Date
---------------------------------
Bill Ate Food 7/11/2015
Steve Painted House 8/12/2011
Steve Painted House 8/25/2011
Neither instance of "Steve - Painted - House" is removed because they are outside of a 5 day window.
I know I can do this using some data structures and the iterrows method of the DataFrame, but is there a way to do this using Pandas drop_duplicates?
Use duplicated
+ diff
in conjunction with groupby
to figure out what rows you want to remove.
c = ['Subject', 'Verb', 'Object']
def f(x):
return x[c].duplicated() & x.Date.diff().dt.days.lt(5)
df = df.sort_values(c)
df[~df.groupby(c).apply(f).values]
Subject Verb Object Date
0 Bill Ate Food 2015-07-11
1 Steve Painted House 2011-08-12
3 Steve Painted House 2011-08-25
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With