Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: resample timeseries with groupby

Given the below pandas DataFrame:

In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00',                                             '2014-08-25 22:07:00','2014-08-25 22:09:00']))           locations = ['HK', 'LDN', 'LDN', 'LDN']           event = ['foo', 'bar', 'baz', 'qux']           df = pd.DataFrame({'Location': locations,                              'Event': event}, index=times)           df Out[115]:                                Event Location           2014-08-25 21:00:00  foo   HK           2014-08-25 21:04:00  bar   LDN           2014-08-25 22:07:00  baz   LDN           2014-08-25 22:09:00  qux   LDN 

I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:

Out[115]:                                HK    LDN           2014-08-25 21:00:00  1     1           2014-08-25 22:00:00  0     2 

I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?

like image 667
AshB Avatar asked Aug 14 '15 14:08

AshB


People also ask

Is Groupby faster on index pandas?

Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.

Does Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.

Does pandas Groupby keep order?

Groupby preserves the order of rows within each group.

What does .first do in Groupby?

Pandas dataframe has groupby([column(s)]). first() method which is used to get the first record from each group. The result of grouby.


2 Answers

In my original post, I suggested using pd.TimeGrouper. Nowadays, use pd.Grouper instead of pd.TimeGrouper. The syntax is largely the same, but TimeGrouper is now deprecated in favor of pd.Grouper.

Moreover, while pd.TimeGrouper could only group by DatetimeIndex, pd.Grouper can group by datetime columns which you can specify through the key parameter.


You could use a pd.Grouper to group the DatetimeIndex'ed DataFrame by hour:

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location']) 

use count to count the number of events in each group:

grouper['Event'].count() #                      Location # 2014-08-25 21:00:00  HK          1 #                      LDN         1 # 2014-08-25 22:00:00  LDN         2 # Name: Event, dtype: int64 

use unstack to move the Location index level to a column level:

grouper['Event'].count().unstack() # Out[49]:  # Location             HK  LDN # 2014-08-25 21:00:00   1    1 # 2014-08-25 22:00:00 NaN    2 

and then use fillna to change the NaNs into zeros.


Putting it all together,

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location']) result = grouper['Event'].count().unstack('Location').fillna(0) 

yields

Location             HK  LDN 2014-08-25 21:00:00   1    1 2014-08-25 22:00:00   0    2 
like image 56
unutbu Avatar answered Sep 19 '22 13:09

unutbu


Pandas 0.21 answer: TimeGrouper is getting deprecated

There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.

Option 1: Use groupby + resample

grouped = df.groupby('Location').resample('H')['Event'].count() 

Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)

grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count() 

They both will result in the following:

Location                      HK        2014-08-25 21:00:00    1 LDN       2014-08-25 21:00:00    1           2014-08-25 22:00:00    2 Name: Event, dtype: int64 

And then reshape:

grouped.unstack('Location', fill_value=0) 

Will output

Location             HK  LDN 2014-08-25 21:00:00   1    1 2014-08-25 22:00:00   0    2 
like image 20
Ted Petrou Avatar answered Sep 19 '22 13:09

Ted Petrou