I have multiple timeframes I want to track and predict values for (ARIMA forecasting 1 period ahead per timeframe), and my algorithm retrieves data on the lowest timeframe. Note: each timeframe is evenly (but not necessarily symmetrically) divisible by lower timeframes. As an important side note, intraday timeframes (1H, 4H) should start at 9:30am (market open time).
1min, 5min, 15min, 30min, 1H, 4H, 1D, 1W, 1M
First, the algorithm does a bulk history call and gets all 1-minute data within a given timespan (warm up data). Then during the course of its runtime (never stops; it's a trading algorithm), it receives data every minute.
The first step (after receiving bulk historical data) is to aggregate it for each respective higher timeframe using a smaller and smaller dataset to save processing time:
i.e.,
1min timeframe data = 1min data (historical bulk data)
5min timeframe data = aggregated 1min timeframe data to 5min
15min timeframe data = aggregated 5min timeframe data to 15min
etc...
Given the following example dataset (and assume there could be other symbols included):
close high low open
symbol time
SPY 2018-05-22 09:31:00 270.245900 270.374166 270.226167 270.305100
2018-05-22 09:32:00 270.344566 270.374166 270.206434 270.245900
2018-05-22 09:33:00 270.374166 270.374166 270.314966 270.344566
2018-05-22 09:34:00 270.275500 270.374166 270.245900 270.374166
2018-05-22 09:35:00 270.413632 270.443232 270.275500 270.275500
2018-05-22 09:36:00 270.502431 270.522165 270.384032 270.413632
2018-05-22 09:37:00 270.541898 270.591231 270.492565 270.502431
Q1: How do I aggregate a multi-index pandas dataframe such that the time index and columns are aggregated per symbol? A partial untested (and likely not working) solution illustrating my intent follows:
# Where timeframe.Frequency is a string such as "1H" or "1min"
df.resample(timeframe.Frequency).agg({"open": "first", "close": "last", "low": "min", "high": "max"})
A1: df.groupby(['symbol', pd.Grouper(freq=timeframe.Frequency, level='time')]).agg({'open':'first', 'close':'last', 'high':'max', 'low':'min'})
Q2: Given that there will be partially completed 'bars' for higher timeframes (ex., 5min timeframe data will contain @9:35am full bar data and @9:37am partial bar data represented as if it were @9:40am), is it wise to have partial data in a given timeframe's dataset, or should I use QuantConnect's consolidator to complete the bar before adding it to the dataset during runtime? How do I account for this partial bar data when processing history?
A2: Partial bars are indeed created when resampling. For example, I may have a single 1 minute data point after June 1st, and my monthly timeframe dataframe will have a June 30th entry with the OHLC values of the 1 minute bar. At this point I can either remove the row or just keep updating it with new values as data comes in.
For your first question, assuming that the data is indexed by symbol and time:
(df.groupby(['symbol', df.index.get_level_values(1).floor('5T')])
.agg({"open": "first",
"close": "last",
"low": "min",
"high": "max"})
)
gives:
open close low high
symbol time
SPY 2018-05-22 09:30:00 270.305 270.276 270.206 270.374
SPY 2018-05-22 09:35:00 270.276 270.542 270.276 270.591
The second question is too qualitative and much depend on your data as well as sampling windows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With