Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to groupby time of day in pandas

I have a time series of several days of 1-minute data, and would like to average it across all days by time of day.

This is very slow:

from datetime import datetime
from pandas import date_range, Series
time_ind = date_range(datetime(2013, 1, 1), datetime(2013, 1, 10), freq='1min')
all_data = Series(randn(len(time_ind)), time_ind)
time_mean = all_data.groupby(lambda x: x.time()).mean()

Takes almost a minute to run!

While something like:

time_mean = all_data.groupby(lambda x: x.minute).mean()

takes only a fraction of a second.

Is there a faster way to group by time of day?

Any idea why this is so slow?

like image 777
joeb1415 Avatar asked Jun 25 '13 03:06

joeb1415


People also ask

Is GroupBy faster on index Pandas?

Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.

Does Pandas support datetime?

pandas supports converting integer or float epoch times to Timestamp and DatetimeIndex . The default unit is nanoseconds, since that is how Timestamp objects are stored internally.

How do I work with dates and times in Pandas?

Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.


1 Answers

Both your "lambda-version" and the time property introduced in version 0.11 seems to be slow in version 0.11.0:

In [4]: %timeit all_data.groupby(all_data.index.time).mean()
1 loops, best of 3: 11.8 s per loop

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
1 loops, best of 3: 11.8 s per loop

With the current master both methods are considerably faster:

In [1]: pd.version.version
Out[1]: '0.11.1.dev-06cd915'

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
1 loops, best of 3: 215 ms per loop

In [6]: %timeit all_data.groupby(all_data.index.time).mean()
10 loops, best of 3: 113 ms per loop
'0.11.1.dev-06cd915'

So you can either update to a master or wait for 0.11.1 which should be released this month.

like image 179
bmu Avatar answered Sep 28 '22 07:09

bmu