Performance issues with pandas and filtering on datetime column

Question

I've a pandas dataframe with a datetime64 object on one of the columns.

    time    volume  complete    closeBid    closeAsk    openBid openAsk highBid highAsk lowBid  lowAsk  closeMid
0   2016-08-07 21:00:00+00:00   9   True    0.84734 0.84842 0.84706 0.84814 0.84734 0.84842 0.84706 0.84814 0.84788
1   2016-08-07 21:05:00+00:00   10  True    0.84735 0.84841 0.84752 0.84832 0.84752 0.84846 0.84712 0.8482  0.84788
2   2016-08-07 21:10:00+00:00   10  True    0.84742 0.84817 0.84739 0.84828 0.84757 0.84831 0.84735 0.84817 0.847795
3   2016-08-07 21:15:00+00:00   18  True    0.84732 0.84811 0.84737 0.84813 0.84737 0.84813 0.84721 0.8479  0.847715
4   2016-08-07 21:20:00+00:00   4   True    0.84755 0.84822 0.84739 0.84812 0.84755 0.84822 0.84739 0.84812 0.847885
5   2016-08-07 21:25:00+00:00   4   True    0.84769 0.84843 0.84758 0.84827 0.84769 0.84843 0.84758 0.84827 0.84806
6   2016-08-07 21:30:00+00:00   5   True    0.84764 0.84851 0.84768 0.84852 0.8478  0.84857 0.84764 0.84851 0.848075
7   2016-08-07 21:35:00+00:00   4   True    0.84755 0.84825 0.84762 0.84844 0.84765 0.84844 0.84755 0.84824 0.8479
8   2016-08-07 21:40:00+00:00   1   True    0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.847855
9   2016-08-07 21:45:00+00:00   3   True    0.84727 0.84817 0.84743 0.8482  0.84743 0.84822 0.84727 0.84817 0.84772

My application follows the (simplified) structure below:

class Runner():
    def execute_tick(self, clock_tick, previous_tick):
        candles = self.broker.get_new_candles(clock_tick, previous_tick)
        if candles:
            run_calculations(candles)

class Broker():
    def get_new_candles(clock_tick, previous_tick)
        start = previous_tick - timedelta(minutes=1)
        end = clock_tick - timedelta(minutes=3)
        return df[(df.time > start) & (df.time <= end)]

I noticed when profiling the app, that calling the df[(df.time > start) & (df.time <= end)] causes the highest performance issues and I was wondering if there is a way to speed up these calls?

EDIT: I'm adding some more info about the use-case here (also, source is available at: https://github.com/jmelett/pyFxTrader)

The application will accept a list of instruments (e.g. EUR_USD, USD_JPY, GBP_CHF) and then pre-fetch ticks/candles for each one of them and their timeframes (e.g. 5 minutes, 30 minutes, 1 hour etc.). The initialised data is basically a dict of Instruments, each containing another dict with candle data for M5, M30, H1 timeframes.
Each "timeframe" is a pandas dataframe like shown at the top
A clock simulator is then used to query the individual candles for the specific time (e.g. at 15:30:00, give me the last x "5-minute-candles") for EUR_USD
This piece of data is then used to "simulate" specific market conditions (e.g. average price over last 1 hour increased by 10%, buy market position)

piRSquared · Accepted Answer

If efficiency is your goal, I'd use numpy for just about everything

I rewrote get_new_candles as get_new_candles2

def get_new_candles2(clock_tick, previous_tick):
    start = previous_tick - timedelta(minutes=1)
    end = clock_tick - timedelta(minutes=3)
    ge_start = df.time.values >= start.to_datetime64()
    le_end = df.time.values <= end.to_datetime64()
    return pd.DataFrame(df.values[ge_start & le_end], df.index[mask], df.columns)

Setup of data

from StringIO import StringIO
import pandas as pd

text = """time,volume,complete,closeBid,closeAsk,openBid,openAsk,highBid,highAsk,lowBid,lowAsk,closeMid
2016-08-07 21:00:00+00:00,9,True,0.84734,0.84842,0.84706,0.84814,0.84734,0.84842,0.84706,0.84814,0.84788
2016-08-07 21:05:00+00:00,10,True,0.84735,0.84841,0.84752,0.84832,0.84752,0.84846,0.84712,0.8482,0.84788
2016-08-07 21:10:00+00:00,10,True,0.84742,0.84817,0.84739,0.84828,0.84757,0.84831,0.84735,0.84817,0.847795
2016-08-07 21:15:00+00:00,18,True,0.84732,0.84811,0.84737,0.84813,0.84737,0.84813,0.84721,0.8479,0.847715
2016-08-07 21:20:00+00:00,4,True,0.84755,0.84822,0.84739,0.84812,0.84755,0.84822,0.84739,0.84812,0.847885
2016-08-07 21:25:00+00:00,4,True,0.84769,0.84843,0.84758,0.84827,0.84769,0.84843,0.84758,0.84827,0.84806
2016-08-07 21:30:00+00:00,5,True,0.84764,0.84851,0.84768,0.84852,0.8478,0.84857,0.84764,0.84851,0.848075
2016-08-07 21:35:00+00:00,4,True,0.84755,0.84825,0.84762,0.84844,0.84765,0.84844,0.84755,0.84824,0.8479
2016-08-07 21:40:00+00:00,1,True,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.847855
2016-08-07 21:45:00+00:00,3,True,0.84727,0.84817,0.84743,0.8482,0.84743,0.84822,0.84727,0.84817,0.84772
"""

df = pd.read_csv(StringIO(text), parse_dates=[0])

Test input variables

previous_tick = pd.to_datetime('2016-08-07 21:10:00')
clock_tick = pd.to_datetime('2016-08-07 21:45:00')

get_new_candles2(clock_tick, previous_tick)

enter image description here

Timing

enter image description here

Performance issues with pandas and filtering on datetime column

Tags:

python

pandas

dataframe

numpy

Joseph jun. Melettukunnel

Video Answer

1 Answers

Setup of data

Test input variables

Timing

piRSquared

Recent Activity

Donate For Us

Performance issues with pandas and filtering on datetime column

Tags:

python

pandas

dataframe

numpy

Joseph jun. Melettukunnel

Video Answer

1 Answers

Setup of data

Test input variables

Timing

piRSquared

Related questions

Recent Activity

Donate For Us