I have a pandas dataframe as follows:
A B C
1 2 x
1 2 y
3 4 z
3 5 x
I want that only 1 row remains of rows that share the same values in specific columns. In the example above I mean columns A and B. In other words, if the values of columns A and B occur more than once in the dataframe, only one row should remain (which one does not matter).
FWIW: the maximum number of so called duplicate rows (that is, where column A and B are the same) is 2.
The result should looke like this:
A B C
1 2 x
3 4 z
3 5 x
or
A B C
1 2 y
3 4 z
3 5 x
Use drop_duplicates
with parameter subset
, for keeping only last duplicated rows add keep='last'
:
df1 = df.drop_duplicates(subset=['A','B'])
#same as
#df1 = df.drop_duplicates(subset=['A','B'], keep='first')
print (df1)
A B C
0 1 2 x
2 3 4 z
3 3 5 x
df2 = df.drop_duplicates(subset=['A','B'], keep='last')
print (df2)
A B C
1 1 2 y
2 3 4 z
3 3 5 x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With