Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up custom aggregation functions

I have a very simple setup: market data (ticks) in a pandas dataframe df like so:

index period ask bid
00:00:00.126 42125 112.118 112.117
00:00:00.228 42125 112.120 112.117
00:00:00.329 42125 112.121 112.120
00:00:00.380 42125 112.123 112.120
00:00:00.432 42125 112.124 112.121
00:00:00.535 41126 112.124 112.121
00:00:00.586 41126 112.122 112.121
00:00:00.687 41126 112.124 112.121
00:00:01.198 41126 112.124 112.120
00:00:01.737 41126 112.124 112.121
00:00:02.243 41126 112.123 112.121

Now I use pandas.groupy to aggregate periods

g=df.groupby('period')

It is easy to get minimum and maximum prices by period, e.g.

import numpy as np
res=g.agg({'ask': [np.amax, np.amin]})

This is reasonably fast, too. Now, I also want first and last price per period. This is where the trouble begins. Of course, I can do:

res=g.agg({'ask': lambda x: x[0]})

and it works, basically, but it is terribly slow for large data sets. Basically, the call overhead for a Python function call is just huge.

Does anyone know of a numpy function analogous to np.amax that will return the first or last element of a group? I could not find one. iloc[0] does not do the trick because it is a method of an object and, hence, I can't pass it as a function to g.agg, because I don't have the object at this stage (that's what the lambda is needed for).

Now, I am not lazy, and I tried to do this for myself using cython.

import numpy as np
cimport numpy as np

cpdef double first(np.ndarray array_series):
    return array_series[0]

But pandas won't accept this as an aggregation function because it passes a pd.core.series-object rather than an np.ndarray. (Nevermind one derives from the other, the compiler does not recognize that.)

Does anyone know how to write a cython function that accepts a pandas series without the python call overhead?

like image 958
user5507059 Avatar asked Oct 30 '15 13:10

user5507059


2 Answers

IIUC then you can do first and last:

In [270]:
g=df.groupby('period')
res=g.agg({'ask': [np.amax, np.amin, 'first', 'last']})
res

Out[270]:
            ask                           
           amax     amin    first     last
period                                    
41126   112.124  112.122  112.124  112.123
42125   112.124  112.118  112.118  112.124
like image 55
EdChum Avatar answered Oct 05 '22 16:10

EdChum


An alternative is to simply resample and use OHLC (open=first,close=last,high=max,low=min)

In [56]: df = DataFrame({'A' : np.arange(10), 'B' : pd.date_range('20130101',periods=5).tolist()*2})

In [57]: df
Out[57]: 
   A          B
0  0 2013-01-01
1  1 2013-01-02
2  2 2013-01-03
3  3 2013-01-04
4  4 2013-01-05
5  5 2013-01-01
6  6 2013-01-02
7  7 2013-01-03
8  8 2013-01-04
9  9 2013-01-05

In [58]: df.set_index('B').resample('D',how='ohlc')
Out[58]: 
              A               
           open high low close
B                             
2013-01-01    0    5   0     5
2013-01-02    1    6   1     6
2013-01-03    2    7   2     7
2013-01-04    3    8   3     8
2013-01-05    4    9   4     9
like image 39
Jeff Avatar answered Oct 05 '22 15:10

Jeff