So I know you can use something like this to drop duplicate lines:
the_data.drop_duplicates(subset=['the_key'])
However, if the_key
is null for some values, like below:
the_key C D
1 NaN * *
2 NaN *
3 111 * *
4 111
It will keep the ones marked in the C
column. Is it possible to get drop_duplicates
to treat all nan
as distinct and get an output keeping the data like in the D
column?
Use duplicated
chained with isna
and filter by boolean indexing
:
df = df[(~df['the_key'].duplicated()) | df['the_key'].isna()]
#fol oldier pandas versions
#df = df[(~df['the_key'].duplicated()) | df['the_key'].isnull()]
print (df)
the_key C D
1 NaN * *
2 NaN *
3 111.0 * *
I'd do it this way:
dupes = the_data.duplicated(subset=['the_key'])
dupes[the_data['the_key'].isnull()] = False
the_data = the_data[~dupes]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With