Drop duplicates, but ignore nulls

Question

So I know you can use something like this to drop duplicate lines:

the_data.drop_duplicates(subset=['the_key'])

However, if the_key is null for some values, like below:

   the_key  C  D
1      NaN  *  *
2      NaN     *
3      111  *  *
4      111

It will keep the ones marked in the C column. Is it possible to get drop_duplicates to treat all nan as distinct and get an output keeping the data like in the D column?

jezrael · Accepted Answer

Use duplicated chained with isna and filter by boolean indexing:

df = df[(~df['the_key'].duplicated()) | df['the_key'].isna()]
#fol oldier pandas versions
#df = df[(~df['the_key'].duplicated()) | df['the_key'].isnull()]
print (df)
   the_key  C    D
1      NaN  *    *
2      NaN       * 
3    111.0  *    *

John Zwinck · Answer

I'd do it this way:

dupes = the_data.duplicated(subset=['the_key'])
dupes[the_data['the_key'].isnull()] = False
the_data = the_data[~dupes]

Drop duplicates, but ignore nulls

Tags:

python

pandas

ifly6

2 Answers

jezrael

John Zwinck

Recent Activity

Donate For Us

Drop duplicates, but ignore nulls

Tags:

python

pandas

ifly6

2 Answers

jezrael

John Zwinck

Related questions

Recent Activity

Donate For Us