My current df looks like this:
IDnumber Subid Subsubid Date Originaldataindicator
a 1 x 2006 NaN
a 1 x 2007 NaN
a 1 x 2008 NaN
a 1 x 2008 1
The originaldataindicator is the result of the fact that some of these observations were created to get all three years for each IDnumber, while some existed in the original dataset. What I want to achieve is to drop the duplicates and prefarably keep the original data. Note that the originaldataindicator will not always be the last observation. To solve this I first sort on Idnumber Date Originaldataindicator
However when I use:
df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'])
Nothing happens and I still observe the duplicate.
df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], inplace=True)
gives me an empty dataframe.
Am I misinterpreting what drop_duplicates does ?
Just to avoid confusion, this is what I want:
IDnumber Subid Subsubid Date Originaldataindicator
a 1 x 2006 NaN
a 1 x 2007 NaN
a 1 x 2008 1
The data includes thousands of these ID's
I think you need groupby
and sort_values
and then use parameter keep=first
of drop_duplicates
:
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 NaN
3 a 1 x 2008 1
4 a 1 x 2008 NaN
df = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date'])
.apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 1
3 a 1 x 2008 NaN
4 a 1 x 2008 NaN
df1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')
print df1
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 1
Or use inplace
:
df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 1
If column Originaldataindicator
have multiple values use duplicated
(maybe ther can be add all columns IDnumber
,Subid
,Subsubid
,Date
) and isnull
:
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 NaN
3 a 1 x 2008 1
4 a 1 x 2008 1
print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))]
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
3 a 1 x 2008 1
4 a 1 x 2008 1
Explaining conditions:
print df.duplicated('Date', keep=False)
0 False
1 False
2 True
3 True
4 True
dtype: bool
print (pd.isnull(df['Originaldataindicator']))
0 True
1 True
2 True
3 False
4 False
Name: Originaldataindicator, dtype: bool
print ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))
0 True
1 True
2 False
3 True
4 True
dtype: bool
Consider this:
df = pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, None, 1, None]})
Then
>>> df.sort_values(by=['a', 'b']).groupby(df.a).first()[['b']].reset_index()
a b
0 1 1
1 2 2
2 3 1
Sorts the items by first a
, then b
(thus pushing the None
values in each group last), then selects the first item per group.
I believe you can modify this to the specifics of your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With