Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop duplicates, but ignore nulls

Tags:

python

pandas

So I know you can use something like this to drop duplicate lines:

the_data.drop_duplicates(subset=['the_key'])

However, if the_key is null for some values, like below:

   the_key  C  D
1      NaN  *  *
2      NaN     *
3      111  *  *
4      111

It will keep the ones marked in the C column. Is it possible to get drop_duplicates to treat all nan as distinct and get an output keeping the data like in the D column?

like image 635
ifly6 Avatar asked May 03 '18 12:05

ifly6


2 Answers

Use duplicated chained with isna and filter by boolean indexing:

df = df[(~df['the_key'].duplicated()) | df['the_key'].isna()]
#fol oldier pandas versions
#df = df[(~df['the_key'].duplicated()) | df['the_key'].isnull()]
print (df)
   the_key  C    D
1      NaN  *    *
2      NaN       * 
3    111.0  *    *
like image 117
jezrael Avatar answered Sep 28 '22 02:09

jezrael


I'd do it this way:

dupes = the_data.duplicated(subset=['the_key'])
dupes[the_data['the_key'].isnull()] = False
the_data = the_data[~dupes]
like image 32
John Zwinck Avatar answered Sep 28 '22 01:09

John Zwinck