So I know you can use something like this to drop duplicate lines:
the_data.drop_duplicates(subset=['the_key'])
However, if the_key is null for some values, like below:
   the_key  C  D
1      NaN  *  *
2      NaN     *
3      111  *  *
4      111
It will keep the ones marked in the C column. Is it possible to get drop_duplicates to treat all nan as distinct and get an output keeping the data like in the D column?
Use duplicated chained with isna and filter by boolean indexing:
df = df[(~df['the_key'].duplicated()) | df['the_key'].isna()]
#fol oldier pandas versions
#df = df[(~df['the_key'].duplicated()) | df['the_key'].isnull()]
print (df)
   the_key  C    D
1      NaN  *    *
2      NaN       * 
3    111.0  *    *
                        I'd do it this way:
dupes = the_data.duplicated(subset=['the_key'])
dupes[the_data['the_key'].isnull()] = False
the_data = the_data[~dupes]
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With