How do I replace duplicates for each group with NaNs while keeping the rows?
I need to keep rows without removing and perhaps keeping the first original value where it shows up first.
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({
'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
'2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
'value': [10,10,10,10,12,12,12,12],
'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 10 Jackie
2 2019-01-01 02:00:00 10 Jackie
3 2019-01-01 03:00:00 10 Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 12 Zoop
6 2019-09-01 04:00:00 12 Zoop
7 2019-09-01 05:00:00 12 Zoop
Desired Dataframe:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
Edit:
Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2.
I assume you check duplicates on columns value
and ID
and further check on date
of column date
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan
Out[269]:
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
As @Trenton suggest, you may use pd.NA
to avoid import numpy
(Note: as @rafaelc sugguest: here is the link explain detail differences between pd.NA
and np.nan
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA
Out[273]:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 <NA> Jackie
2 2019-01-01 02:00:00 <NA> Jackie
3 2019-01-01 03:00:00 <NA> Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 <NA> Zoop
6 2019-09-01 04:00:00 <NA> Zoop
7 2019-09-01 05:00:00 <NA> Zoop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With