With this DataFrame:
d = {'A' : pd.Series(['AA', 'AA', 'AA', 'BB','CC'],
index=['a', 'b', 'c', 'd','e']),
'B' : pd.Series([1., 2., 3.], index=['b', 'd','e']),
'C' : pd.Series([4., 5., 6.], index=['b', 'd', '']),
'D' : pd.Series([1., 2., 3.,4.], index=['a', 'c', 'd','e'])}
In[1]: pd.DataFrame(d)
Out[1]:
A B C D
a AA NaN NaN 1.0
b AA 1.0 4.0 NaN
c AA NaN NaN 2.0
d BB 2.0 5.0 3.0
e CC 3.0 6.0 4.0
I would like to drop duplicates on df['A']
and keep the row with the fewest null values in the columns that are not being dropped on
.
In[2]: pd.DataFrame(d).drop_duplicates(on='A', **magical_answer=True**)
Out[1]:
A B C D
b AA 1.0 4.0 NaN
d BB 2.0 5.0 3.0
e CC 3.0 6.0 4.0
I can see a possible issue not enumerated in this example would occur if there are multiple rows with the fewest nulls, in that case it would be useful to have the keep : {‘first’, ‘last’}
arg.
Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.
By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
Pandas drop_duplicates() Function Syntax keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.
drop_duplicates() Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
An alternative would be to count the number of items in each row, sort the DataFrame and keep the last item so that it has the highest count.
(df.assign(counts=df.count(axis=1))
.sort_values(['A', 'counts'])
.drop_duplicates('A', keep='last')
.drop('counts', axis=1))
Out:
A B C D
b AA 1.0 4.0 NaN
d BB 2.0 5.0 3.0
e CC 3.0 6.0 4.0
If you don't have duplicated index, you can do:
df.loc[df.notnull().sum(1).groupby(df.A).idxmax()]
# A B C D
#b AA 1.0 4.0 NaN
#d BB 2.0 5.0 3.0
#e CC 3.0 6.0 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With