Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop Duplicates in a DataFrame Keeping the Row with the Least Nulls

Tags:

python

pandas

With this DataFrame:

d = {'A' : pd.Series(['AA', 'AA', 'AA', 'BB','CC'], 
           index=['a', 'b', 'c', 'd','e']),
     'B' : pd.Series([1., 2., 3.], index=['b', 'd','e']),
     'C' : pd.Series([4., 5., 6.], index=['b', 'd', '']),
     'D' : pd.Series([1., 2., 3.,4.], index=['a', 'c', 'd','e'])}

In[1]: pd.DataFrame(d)

Out[1]: 
     A    B    C    D
 a  AA  NaN  NaN  1.0
 b  AA  1.0  4.0  NaN
 c  AA  NaN  NaN  2.0
 d  BB  2.0  5.0  3.0
 e  CC  3.0  6.0  4.0

I would like to drop duplicates on df['A'] and keep the row with the fewest null values in the columns that are not being dropped on.

In[2]: pd.DataFrame(d).drop_duplicates(on='A', **magical_answer=True**)

Out[1]: 
     A    B    C    D
 b  AA  1.0  4.0  NaN
 d  BB  2.0  5.0  3.0
 e  CC  3.0  6.0  4.0

I can see a possible issue not enumerated in this example would occur if there are multiple rows with the fewest nulls, in that case it would be useful to have the keep : {‘first’, ‘last’} arg.

like image 388
it's-yer-boy-chet Avatar asked May 03 '17 20:05

it's-yer-boy-chet


People also ask

Does pandas drop duplicates keep first?

Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.

How do you drop duplicate rows in a DataFrame?

By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.

How do you drop duplicates in pandas with conditions?

Pandas drop_duplicates() Function Syntax keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.

How do I get rid of duplicate rows in pandas?

drop_duplicates() Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.


2 Answers

An alternative would be to count the number of items in each row, sort the DataFrame and keep the last item so that it has the highest count.

(df.assign(counts=df.count(axis=1))
   .sort_values(['A', 'counts'])
   .drop_duplicates('A', keep='last')
   .drop('counts', axis=1))
Out: 
    A    B    C    D
b  AA  1.0  4.0  NaN
d  BB  2.0  5.0  3.0
e  CC  3.0  6.0  4.0
like image 144
ayhan Avatar answered Oct 06 '22 23:10

ayhan


If you don't have duplicated index, you can do:

df.loc[df.notnull().sum(1).groupby(df.A).idxmax()]

#    A    B   C   D
#b  AA  1.0 4.0 NaN
#d  BB  2.0 5.0 3.0
#e  CC  3.0 6.0 4.0
like image 35
Psidom Avatar answered Oct 07 '22 00:10

Psidom