Drop duplicate rows from a pandas DataFrame whose timestamps are within a specified range or duration

Question

I have a DataFrame like this:

Subject Verb    Object  Date
---------------------------------
Bill    Ate     Food    7/11/2015
Steve   Painted House   8/12/2011
Bill    Ate     Food    7/13/2015
Steve   Painted House   8/25/2011

I would like to drop all duplicates, where a duplicate is defined as having the same Subject, Verb, Object, and falls within an X day range (in my example: 5 days).

Subject Verb    Object  Date
---------------------------------
Bill    Ate     Food    7/11/2015
Steve   Painted House   8/12/2011
Steve   Painted House   8/25/2011

Neither instance of "Steve - Painted - House" is removed because they are outside of a 5 day window.

I know I can do this using some data structures and the iterrows method of the DataFrame, but is there a way to do this using Pandas drop_duplicates?

cs95 · Accepted Answer

Use duplicated + diff in conjunction with groupby to figure out what rows you want to remove.

c = ['Subject', 'Verb', 'Object']

def f(x):
    return x[c].duplicated() & x.Date.diff().dt.days.lt(5)

df = df.sort_values(c)
df[~df.groupby(c).apply(f).values]

  Subject     Verb Object       Date
0    Bill      Ate   Food 2015-07-11
1   Steve  Painted  House 2011-08-12
3   Steve  Painted  House 2011-08-25

Drop duplicate rows from a pandas DataFrame whose timestamps are within a specified range or duration

Tags:

python

pandas

dataframe

duplicates

Mike R

1 Answers

cs95

Recent Activity

Donate For Us

Drop duplicate rows from a pandas DataFrame whose timestamps are within a specified range or duration

Tags:

python

pandas

dataframe

duplicates

Mike R

1 Answers

cs95

Related questions

Recent Activity

Donate For Us