Given the below pandas DataFrame:
In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00', '2014-08-25 22:07:00','2014-08-25 22:09:00'])) locations = ['HK', 'LDN', 'LDN', 'LDN'] event = ['foo', 'bar', 'baz', 'qux'] df = pd.DataFrame({'Location': locations, 'Event': event}, index=times) df Out[115]: Event Location 2014-08-25 21:00:00 foo HK 2014-08-25 21:04:00 bar LDN 2014-08-25 22:07:00 baz LDN 2014-08-25 22:09:00 qux LDN
I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:
Out[115]: HK LDN 2014-08-25 21:00:00 1 1 2014-08-25 22:00:00 0 2
I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?
Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.
The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.
Groupby preserves the order of rows within each group.
Pandas dataframe has groupby([column(s)]). first() method which is used to get the first record from each group. The result of grouby.
In my original post, I suggested using pd.TimeGrouper
. Nowadays, use pd.Grouper
instead of pd.TimeGrouper
. The syntax is largely the same, but TimeGrouper
is now deprecated in favor of pd.Grouper
.
Moreover, while pd.TimeGrouper
could only group by DatetimeIndex, pd.Grouper
can group by datetime columns which you can specify through the key
parameter.
You could use a pd.Grouper
to group the DatetimeIndex'ed DataFrame by hour:
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
use count
to count the number of events in each group:
grouper['Event'].count() # Location # 2014-08-25 21:00:00 HK 1 # LDN 1 # 2014-08-25 22:00:00 LDN 2 # Name: Event, dtype: int64
use unstack
to move the Location
index level to a column level:
grouper['Event'].count().unstack() # Out[49]: # Location HK LDN # 2014-08-25 21:00:00 1 1 # 2014-08-25 22:00:00 NaN 2
and then use fillna
to change the NaNs into zeros.
Putting it all together,
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location']) result = grouper['Event'].count().unstack('Location').fillna(0)
yields
Location HK LDN 2014-08-25 21:00:00 1 1 2014-08-25 22:00:00 0 2
There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.
Option 1: Use groupby + resample
grouped = df.groupby('Location').resample('H')['Event'].count()
Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)
grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()
They both will result in the following:
Location HK 2014-08-25 21:00:00 1 LDN 2014-08-25 21:00:00 1 2014-08-25 22:00:00 2 Name: Event, dtype: int64
And then reshape:
grouped.unstack('Location', fill_value=0)
Will output
Location HK LDN 2014-08-25 21:00:00 1 1 2014-08-25 22:00:00 0 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With