If I have following dataframe
| id | timestamp | code | id2
| 10 | 2017-07-12 13:37:00 | 206 | a1
| 10 | 2017-07-12 13:40:00 | 206 | a1
| 10 | 2017-07-12 13:55:00 | 206 | a1
| 10 | 2017-07-12 19:00:00 | 206 | a2
| 11 | 2017-07-12 13:37:00 | 206 | a1
...
I need to group by id, id2
columns and get the first occurrence of timestamp
value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00
.
I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:
df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)
groupby() method. Once the group is created, the first row of the group will be accessed with the nth() method inside which we will pass the index the row which we want, here we want the index 0 of each group.
Pandas DataFrame first() Method The first() method returns the first n rows, based on the specified value. The index have to be dates for this method to work as expected.
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.
You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.
I think you need GroupBy.first
:
df.groupby(["id", "id2"])["timestamp"].first()
Or drop_duplicates
:
df.drop_duplicates(subset=['id','id2'])
For same output:
df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With