Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Aggregate/Group by based on most recent date

I have a DataFrame as follows, where Id is a string and Date is a datetime:

Id    Date
1     3-1-2012
1     4-8-2013
2     1-17-2013
2     5-4-2013
2     10-30-2012
3     1-3-2013

I'd like to consolidate the table to just show one row for each Id which has the most recent date.
Any thoughts on how to do this?

like image 645
ChrisArmstrong Avatar asked Jun 10 '13 17:06

ChrisArmstrong


People also ask

What does Group_by do in pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

What does AGG function do in pandas?

Pandas DataFrame agg() Method The agg() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.

What is Groupby AGG?

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame. apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.


2 Answers

You can groupby the Id field:

In [11]: df
Out[11]:
  Id                Date
0  1 2012-03-01 00:00:00
1  1 2013-04-08 00:00:00
2  2 2013-01-17 00:00:00
3  2 2013-05-04 00:00:00
4  2 2012-10-30 00:00:00
5  3 2013-01-03 00:00:00

In [12]: g = df.groupby('Id')

If you are not certain about the ordering, you could do something along the lines:

In [13]: g.agg(lambda x: x.iloc[x.Date.argmax()])
Out[13]:
                  Date
Id
1  2013-04-08 00:00:00
2  2013-05-04 00:00:00
3  2013-01-03 00:00:00

which for each group grabs the row with largest (latest) date (the argmax part).

If you knew they were in order you could take the last (or first) entry:

In [14]: g.last()
Out[14]:
                  Date
Id
1  2013-04-08 00:00:00
2  2012-10-30 00:00:00
3  2013-01-03 00:00:00

(Note: they're not in order, so this doesn't work in this case!)

like image 59
Andy Hayden Avatar answered Oct 28 '22 00:10

Andy Hayden


In the Hayden response, I think that using x.loc in place of x.iloc is better, as the index of the df dataframe could be sparse (and in this case the iloc will not work).

(I haven't enought points on stackoverflow to post it in comments of the response).

like image 37
head7 Avatar answered Oct 28 '22 02:10

head7