Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: get the first occurrence grouping by keys

Tags:

python

pandas

If I have following dataframe

| id | timestamp           | code | id2
| 10 | 2017-07-12 13:37:00 | 206  | a1
| 10 | 2017-07-12 13:40:00 | 206  | a1
| 10 | 2017-07-12 13:55:00 | 206  | a1
| 10 | 2017-07-12 19:00:00 | 206  | a2
| 11 | 2017-07-12 13:37:00 | 206  | a1
...

I need to group by id, id2 columns and get the first occurrence of timestamp value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00.

I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:

df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)
like image 778
Novitoll Avatar asked Jul 12 '17 12:07

Novitoll


People also ask

How do you get the first row of each group in a Dataframe?

groupby() method. Once the group is created, the first row of the group will be accessed with the nth() method inside which we will pass the index the row which we want, here we want the index 0 of each group.

What is first () in Pandas?

Pandas DataFrame first() Method The first() method returns the first n rows, based on the specified value. The index have to be dates for this method to work as expected.

How do you get groupby index in Pandas?

How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.

How do you get the first 5 rows in Pandas?

You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.


1 Answers

I think you need GroupBy.first:

df.groupby(["id", "id2"])["timestamp"].first()

Or drop_duplicates:

df.drop_duplicates(subset=['id','id2'])

For same output:

df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
   id id2            timestamp
0  10  a1  2017-07-12 13:37:00
1  10  a2  2017-07-12 19:00:00
2  11  a1  2017-07-12 13:37:00

df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
   id id2            timestamp
0  10  a1  2017-07-12 13:37:00
1  10  a2  2017-07-12 19:00:00
2  11  a1  2017-07-12 13:37:00
like image 61
jezrael Avatar answered Sep 30 '22 16:09

jezrael