Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Consider duplicate index in drop_duplicates method of a pandas DataFrame

The drop_duplicates method of a Pandas DataFrame considers all columns (default) or a subset of columns (optional) in removing duplicate rows, and cannot consider duplicate index.

I am looking for a clean one-line solution that considers the index and a subset or all columns in determining duplicate rows. For example, consider the DataFrame

df = pd.DataFrame(index=['a', 'b', 'b', 'c'], data={'A': [0, 0, 0, 0], 'B': [1, 0, 0, 0]})
   A  B
a  0  1
b  0  0
b  0  0
c  0  0

Default use of the drop_duplicates method gives

df.drop_duplicates()
   A  B
a  0  1
b  0  0

If the index is also considered in determining duplicate rows, the result should be

df.drop_duplicates(consider_index=True) # not a supported keyword argument
   A  B
a  0  1
b  0  0
c  0  0

Is there a single method providing this functionality, that is better than my current approach:

df['index'] = df.index
df.drop_duplicates(inplace=True)
del df['index']
like image 746
Russell Burdt Avatar asked Aug 30 '18 20:08

Russell Burdt


1 Answers

Call reset_index and duplicated, and then index the original:

df = df[~df.reset_index().duplicated().values]
print (df)
   A  B
a  0  1
b  0  0
c  0  0
like image 60
cs95 Avatar answered Sep 18 '22 22:09

cs95