I want to select the first row when there are multiple rows with repeated values in a column.
For example:
import pandas as pd
df = pd.DataFrame({'col1':['one', 'one', 'one', 'one', 'one', 'one', 'one', 'one'],
'col2':['ID=ABCD1234', 'ID=ABCD1234', 'ID=ABCD1234', 'ID=ABCD5678',
'ID=ABCD5678', 'ID=ABCD5678', 'ID=ABCD9102', 'ID=ABCD9102']})
The pandas dataframe looks like this:
print(df)
col1 col2
0 one ID=ABCD1234
1 one ID=ABCD1234
2 one ID=ABCD1234
3 one ID=ABCD5678
4 one ID=ABCD5678
5 one ID=ABCD5678
6 one ID=ABCD9102
7 one ID=ABCD9102
I want the row 0, row 3, and row 6 to be selected and output as a new dataframe.
Expected output:
col1 col2
0 one ID=ABCD1234
3 one ID=ABCD5678
6 one ID=ABCD9102
You can use:
df.drop_duplicates(subset = ['col2'], keep = 'first', inplace = True)
Simply group by the values of the row and use first()
to pick the first row:
df.groupby('col2').first()
You might decide to group by multiple columns too:
df.groupby(['col1', 'col2']).first()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With