Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - grouping intra day timeseries by date

I have an intra day series of log returns over multiple days that I would like to downsample to daily ohlc. I can do something like

hi = series.resample('B', how=lambda x: np.max(np.cumsum()))
low = series.resample('B', how=lambda x: np.min(np.cumsum())) 

But it seems inefficient to compute cumsum on each call. Is there a way to first compute the cumsums and then apply 'ohcl' to the data?

1999-08-09 12:30:00-04:00   -0.000486
1999-08-09 12:31:00-04:00   -0.000606
1999-08-09 12:32:00-04:00   -0.000120
1999-08-09 12:33:00-04:00   -0.000037
1999-08-09 12:34:00-04:00   -0.000337
1999-08-09 12:35:00-04:00    0.000100
1999-08-09 12:36:00-04:00    0.000219
1999-08-09 12:37:00-04:00    0.000285
1999-08-09 12:38:00-04:00   -0.000981
1999-08-09 12:39:00-04:00   -0.000487
1999-08-09 12:40:00-04:00    0.000476
1999-08-09 12:41:00-04:00    0.000362
1999-08-09 12:42:00-04:00   -0.000038
1999-08-09 12:43:00-04:00   -0.000310
1999-08-09 12:44:00-04:00   -0.000337
...
1999-09-28 06:45:00-04:00    0.000000
1999-09-28 06:46:00-04:00    0.000000
1999-09-28 06:47:00-04:00    0.000000
1999-09-28 06:48:00-04:00    0.000102
1999-09-28 06:49:00-04:00   -0.000068
1999-09-28 06:50:00-04:00    0.000136
1999-09-28 06:51:00-04:00    0.000566
1999-09-28 06:52:00-04:00    0.000469
1999-09-28 06:53:00-04:00    0.000000
1999-09-28 06:54:00-04:00    0.000000
1999-09-28 06:55:00-04:00    0.000000
1999-09-28 06:56:00-04:00    0.000000
1999-09-28 06:57:00-04:00    0.000000
1999-09-28 06:58:00-04:00    0.000000
1999-09-28 06:59:00-04:00    0.000000
like image 650
signalseeker Avatar asked Feb 01 '13 12:02

signalseeker


2 Answers

df.groupby([df.index.year, df.index.month, df.index.day]).transform(np.cumsum).resample('B', how='ohlc')

I think this might be what I want but I have to test.

EDIT: After zelazny7's repsonse:

df.groupby(pd.TimeGrouper('D')).transform(np.cumsum).resample('D', how='ohlc')

works and is also more efficient than my previous solution.

UPDATE:

pd.TimeGrouper('D') is deprecated since pandas v0.21.0.

Use pd.Grouper() instead:

df.groupby(pd.Grouper(freq='D')).transform(np.cumsum).resample('D', how='ohlc')
like image 127
signalseeker Avatar answered Oct 21 '22 20:10

signalseeker


I wasn't able to get your resample suggestion to work. Did you have any luck? Here's a way to aggregate the data at the business day level and compute the OHLC stats in one pass:

from io import BytesIO
from pandas import *

text = """1999-08-09 12:30:00-04:00   -0.000486
1999-08-09 12:31:00-04:00   -0.000606
1999-08-09 12:32:00-04:00   -0.000120
1999-08-09 12:33:00-04:00   -0.000037
1999-08-09 12:34:00-04:00   -0.000337
1999-08-09 12:35:00-04:00    0.000100
1999-08-09 12:36:00-04:00    0.000219
1999-08-09 12:37:00-04:00    0.000285
1999-08-09 12:38:00-04:00   -0.000981
1999-08-09 12:39:00-04:00   -0.000487
1999-08-09 12:40:00-04:00    0.000476
1999-08-09 12:41:00-04:00    0.000362
1999-08-09 12:42:00-04:00   -0.000038
1999-08-09 12:43:00-04:00   -0.000310
1999-08-09 12:44:00-04:00   -0.000337"""

df = read_csv(BytesIO(text), sep='\s+', parse_dates=[[0,1]], index_col=[0], header=None)

Here I create a dictionary of dictionaries. The outer key references the columns you want to apply the functions to. The inner key contains the names of your aggregation functions and the inner values are the functions you want to apply:

f = {2: {'O':'first',
         'H':'max',
         'L':'min',
         'C':'last'}}

df.groupby(TimeGrouper(freq='B')).agg(f)

Out:
                   2
                   H         C         L         O
1999-08-09  0.000476 -0.000337 -0.000981 -0.000486
like image 29
Zelazny7 Avatar answered Oct 21 '22 18:10

Zelazny7