Could someone please point me in the right direction with respect to OHLC data timeframe conversion with Pandas? What I'm trying to do is build a Dataframe with data for higher timeframes, given data with lower timeframe. For example, given I have the following one-minute (M1) data: <pre class="prettyprint"><code> Open High Low Close Volume Date 1999-01-04 10:22:00 1.1801 1.1819 1.1801 1.1817 4 1999-01-04 10:23:00 1.1817 1.1818 1.1804 1.1814 18 1999-01-04 10:24:00 1.1817 1.1817 1.1802 1.1806 12 1999-01-04 10:25:00 1.1807 1.1815 1.1795 1.1808 26 1999-01-04 10:26:00 1.1803 1.1806 1.1790 1.1806 4 1999-01-04 10:27:00 1.1801 1.1801 1.1779 1.1786 23 1999-01-04 10:28:00 1.1795 1.1801 1.1776 1.1788 28 1999-01-04 10:29:00 1.1793 1.1795 1.1782 1.1789 10 1999-01-04 10:31:00 1.1780 1.1792 1.1776 1.1792 12 1999-01-04 10:32:00 1.1788 1.1792 1.1788 1.1791 4 </code></pre> which has Open, High, Low, Close (OHLC) and volume values for every minute I would like to build a set of 5-minute readings (M5) which would look like so: <pre class="prettyprint"><code> Open High Low Close Volume Date 1999-01-04 10:25:00 1.1807 1.1815 1.1776 1.1789 91 1999-01-04 10:30:00 1.1780 1.1792 1.1776 1.1791 16 </code></pre> So the workflow is that: <ul> <li>Open is the Open of the first row in the timewindow</li> <li>High is the highest High in the timewindow</li> <li>Low is the lowest Low</li> <li>Close is the last Close</li> <li>Volume is simply a sum of Volumes</li> </ul> There are few issues though: <ul> <li>the data has gaps ( note there is no 10:30:00 row)</li> <li>the 5-minute intervals have to start at round time, e.g. M5 starts at 10:25:00 not 10:22:00</li> <li>first, incomplete set can be omitted like in this example, or included (so we could have 10:20:00 5-minute entry)</li> </ul> The Pandas documentation on up-down sampling gives an example, but they use mean value as the value of up-sampled row, which won't work here. I have tried using <code>groupby</code> and <code>agg</code> but to no avail. For one getting highest High and lowest Low might be not so hard, but I have no idea how to get first Open and last Close. What I tried is something along the lines of: <pre class="prettyprint"><code>grouped = slice.groupby( dr5minute.asof ).agg( { 'Low': lambda x : x.min()[ 'Low' ], 'High': lambda x : x.max()[ 'High' ] } ) </code></pre> but it results in following error, which I don't understand: <pre class="prettyprint"><code>In [27]: grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } ) --------------------------------------------------------------------------- IndexError Traceback (most recent call last) /work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <module>() ----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } ) /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in agg(self, func, *args, **kwargs) 242 See docstring for aggregate 243 """ --> 244 return self.aggregate(func, *args, **kwargs) 245 246 def _iterate_slices(self): /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, arg, *args, **kwargs) 1153 colg = SeriesGroupBy(obj[col], column=col, 1154 grouper=self.grouper) -> 1155 result[col] = colg.aggregate(func) 1156 1157 result = DataFrame(result) /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs) 906 return self._python_agg_general(func_or_funcs, *args, **kwargs) 907 except Exception: --> 908 result = self._aggregate_named(func_or_funcs, *args, **kwargs) 909 910 index = Index(sorted(result), name=self.grouper.names[0]) /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in _aggregate_named(self, func, *args, **kwargs) 976 grp = self.get_group(name) 977 grp.name = name --> 978 output = func(grp, *args, **kwargs) 979 if isinstance(output, np.ndarray): 980 raise Exception('Must produce aggregated value') /work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <lambda>(x) ----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } ) IndexError: invalid index to scalar variable. </code></pre> So any help on doing that would be greatly appreciated. If the path I chose is not going to work, please suggest other relatively efficient approach (I have millions of rows). Some resources on using Pandas for financial processing would also be nice.

With a more recent version of Pandas, there is a <code>resample</code> method. It is very fast and is useful to accomplish the same task: <pre class="prettyprint"><code>ohlc_dict = { 'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last', 'Volume': 'sum', } df.resample('5T', closed='left', label='left').apply(ohlc_dict) </code></pre>

Converting OHLC stock data into a different timeframe with python and pandas

Tags:

python

pandas

stock

Could someone please point me in the right direction with respect to OHLC data timeframe conversion with Pandas? What I'm trying to do is build a Dataframe with data for higher timeframes, given data with lower timeframe.

For example, given I have the following one-minute (M1) data:

                       Open    High     Low   Close  Volume Date                                                        1999-01-04 10:22:00  1.1801  1.1819  1.1801  1.1817       4 1999-01-04 10:23:00  1.1817  1.1818  1.1804  1.1814      18 1999-01-04 10:24:00  1.1817  1.1817  1.1802  1.1806      12 1999-01-04 10:25:00  1.1807  1.1815  1.1795  1.1808      26 1999-01-04 10:26:00  1.1803  1.1806  1.1790  1.1806       4 1999-01-04 10:27:00  1.1801  1.1801  1.1779  1.1786      23 1999-01-04 10:28:00  1.1795  1.1801  1.1776  1.1788      28 1999-01-04 10:29:00  1.1793  1.1795  1.1782  1.1789      10 1999-01-04 10:31:00  1.1780  1.1792  1.1776  1.1792      12 1999-01-04 10:32:00  1.1788  1.1792  1.1788  1.1791       4

which has Open, High, Low, Close (OHLC) and volume values for every minute I would like to build a set of 5-minute readings (M5) which would look like so:

                       Open    High     Low   Close  Volume Date                                                        1999-01-04 10:25:00  1.1807  1.1815  1.1776  1.1789      91 1999-01-04 10:30:00  1.1780  1.1792  1.1776  1.1791      16

So the workflow is that:

Open is the Open of the first row in the timewindow
High is the highest High in the timewindow
Low is the lowest Low
Close is the last Close
Volume is simply a sum of Volumes

There are few issues though:

the data has gaps ( note there is no 10:30:00 row)
the 5-minute intervals have to start at round time, e.g. M5 starts at 10:25:00 not 10:22:00
first, incomplete set can be omitted like in this example, or included (so we could have 10:20:00 5-minute entry)

The Pandas documentation on up-down sampling gives an example, but they use mean value as the value of up-sampled row, which won't work here. I have tried using groupby and agg but to no avail. For one getting highest High and lowest Low might be not so hard, but I have no idea how to get first Open and last Close.

What I tried is something along the lines of:

grouped = slice.groupby( dr5minute.asof ).agg(      { 'Low': lambda x : x.min()[ 'Low' ], 'High': lambda x : x.max()[ 'High' ] }  )

but it results in following error, which I don't understand:

In [27]: grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } ) --------------------------------------------------------------------------- IndexError                                Traceback (most recent call last) /work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <module>() ----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )  /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in agg(self, func, *args, **kwargs)     242         See docstring for aggregate     243         """ --> 244         return self.aggregate(func, *args, **kwargs)     245      246     def _iterate_slices(self):  /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, arg, *args, **kwargs)    1153                     colg = SeriesGroupBy(obj[col], column=col,    1154                                          grouper=self.grouper) -> 1155                     result[col] = colg.aggregate(func)    1156     1157             result = DataFrame(result)  /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)     906                 return self._python_agg_general(func_or_funcs, *args, **kwargs)     907             except Exception: --> 908                 result = self._aggregate_named(func_or_funcs, *args, **kwargs)     909      910             index = Index(sorted(result), name=self.grouper.names[0])  /usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in _aggregate_named(self, func, *args, **kwargs)     976             grp = self.get_group(name)     977             grp.name = name --> 978             output = func(grp, *args, **kwargs)     979             if isinstance(output, np.ndarray):     980                 raise Exception('Must produce aggregated value')  /work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <lambda>(x) ----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )  IndexError: invalid index to scalar variable.

So any help on doing that would be greatly appreciated. If the path I chose is not going to work, please suggest other relatively efficient approach (I have millions of rows). Some resources on using Pandas for financial processing would also be nice.

537

asked Mar 30 '12 13:03

kgr

2 Answers

With a more recent version of Pandas, there is a resample method. It is very fast and is useful to accomplish the same task:

ohlc_dict = {                                                                                                                  'Open': 'first',                                                                                                         'High': 'max',                                                                                                            'Low': 'min',                                                                                                             'Close': 'last',                                                                                                         'Volume': 'sum', }  df.resample('5T', closed='left', label='left').apply(ohlc_dict)

195

answered Sep 21 '22 02:09

Andrea

Your approach is sound, but fails because each function in the dict-of-functions applied to agg() receives a Series object reflecting the column matched by the key value. Therefore, it's not necessary to filter on column label again. With this, and assuming groupby preserves order, you can slice the Series to extract the first/last element of the Open/Close columns (note: groupby documentation does not claim to preserve order of original data series, but seems to in practice.)

In [50]: df.groupby(dr5minute.asof).agg({'Low': lambda s: s.min(),                                           'High': lambda s: s.max(),                                          'Open': lambda s: s[0],                                          'Close': lambda s: s[-1],                                          'Volume': lambda s: s.sum()}) Out[50]:                        Close    High     Low    Open  Volume key_0                                                       1999-01-04 10:20:00  1.1806  1.1819  1.1801  1.1801      34 1999-01-04 10:25:00  1.1789  1.1815  1.1776  1.1807      91 1999-01-04 10:30:00  1.1791  1.1792  1.1776  1.1780      16

For reference, here is a table to summarize the expected input and output types of an aggregation function based on the groupby object type and how the aggregation function(s) is/are passed to agg().

                  agg() method     agg func    agg func          agg()                   input type       accepts     returns           result GroupBy Object SeriesGroupBy     function         Series      value             Series                   dict-of-funcs    Series      value             DataFrame, columns match dict keys                   list-of-funcs    Series      value             DataFrame, columns match func names DataFrameGroupBy  function         DataFrame   Series/dict/ary   DataFrame, columns match original DataFrame                   dict-of-funcs    Series      value             DataFrame, columns match dict keys, where dict keys must be columns in original DataFrame                   list-of-funcs    Series      value             DataFrame, MultiIndex columns (original cols x func names)

From the above table, if aggregation requires access to more than one column, the only option is to pass a single function to a DataFrameGroupBy object. Therefore, an alternate way to accomplish the original task is to define a function like the following:

def ohlcsum(df):     df = df.sort()     return {        'Open': df['Open'][0],        'High': df['High'].max(),        'Low': df['Low'].min(),        'Close': df['Close'][-1],        'Volume': df['Volume'].sum()       }

and apply agg() with it:

In [30]: df.groupby(dr5minute.asof).agg(ohlcsum) Out[30]:                         Open    High     Low   Close  Volume key_0                                                       1999-01-04 10:20:00  1.1801  1.1819  1.1801  1.1806      34 1999-01-04 10:25:00  1.1807  1.1815  1.1776  1.1789      91 1999-01-04 10:30:00  1.1780  1.1792  1.1776  1.1791      16

Though pandas may offer some cleaner built-in magic in the future, hopefully this explains how to work with today's agg() capabilities.

answered Sep 19 '22 02:09

Garrett

Related questions
                            
                                Django Rest Framework - How to test ViewSet?
                            
                                Does enumerate() produce a generator object?
                            
                                Removing \u2018 and \u2019 character
                            
                                How can I add N milliseconds to a datetime in Python
                            
                                Why isn't PyCharm's autocomplete working for libraries I install?
                            
                                how to "source" file into python script
                            
                                pandas reset_index after groupby.value_counts()
                            
                                ImportError: No module named IPython
                            
                                How to extract the regression coefficient from statsmodels.api?
                            
                                Using .gitignore with *.pyc in subfolders
                            
                                Python Mechanize select a form with no name
                            
                                How to temporarily modify sys.path in Python?
                            
                                How to select element using XPATH syntax on Selenium for Python?
                            
                                Ubuntu: pip not working with python3.4
                            
                                Invalid command WSGIDaemonProcess Deploy Django application on CentOS 6.7
                            
                                pycharm ssh interpter No such file or directory
                            
                                How to save image created with 'pandas.DataFrame.plot'? [duplicate]
                            
                                How to insert a small image on the corner of a plot with matplotlib?
                            
                                Line plot with arrows in matplotlib
                            
                                Setting a relative frequency in a matplotlib histogram

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With