Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to process pandas DataFrame timeseries with Numba

I have a DataFrame with 1,500,000 rows. It's one-minute level stock market data that I bought from QuantQuote.com. (Open, High, Low, Close, Volume). I'm trying to run some home-made backtests of stockmarket trading strategies. Straight python code to process the transactions is too slow and I wanted to try to use numba to speed things up. The trouble is that numba doesn't seem to work with pandas functions.

Google searches uncover a surprising lack of information about using numba with pandas. Which makes me wonder if I'm making a mistake by considering it.

My setup is Numba 0.13.0-1, Pandas 0.13.1-1. Windows 7, MS VS2013 with PTVS, Python 2.7, Enthought Canopy

My existing Python+Pandas innerloop has the following general structure

  • Compute "indicator" columns, (with pd.ewma, pd.rolling_max, pd.rolling_min etc.)
  • Compute "event" columns for predetermined events such as moving average crosses, new highs etc.

I then use DataFrame.iterrows to process the DataFrame.

I've tried various optimizations but it's still not as fast as I would like. And the optimizations are causing bugs.

I want to use numba to process the rows. Are there preferred methods of approaching this?

Because my DataFrame is really just a rectangle of floats, I was considering using something like DataFrame.values to get access to the data and then write a series of functions that use numba to access the rows. But that removes all the timestamps and I don't think it is a reversible operation. I'm not sure if the values matrix that I get from DataFrame.values is guaranteed to not be a copy of the data.

Any help is greatly appreciated.

like image 895
JasonEdinburgh Avatar asked May 13 '14 11:05

JasonEdinburgh


People also ask

What is the fastest way to iterate over pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Can Numba work with pandas?

Numba can be used in 2 ways with pandas: Specify the engine="numba" keyword in select pandas methods. Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or DataFrame (using to_numpy() ) into the function.

Is Iterrows faster than apply?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.


1 Answers

Numba is a NumPy-aware just-in-time compiler. You can pass NumPy arrays as parameters to your Numba-compiled functions, but not Pandas series.

Your only option, still as of 2017-06-27, is to use the Pandas series values, which are actually NumPy arrays.

Also, you ask if the values are "guaranteed to not be a copy of the data". They are not a copy, you can verify that:

import pandas


df = pandas.DataFrame([0, 1, 2, 3])
df.values[2] = 8
print(df)  # Should show you the value `8`

In my opinion, Numba is a great (if not the best) approach to processing market data and you want to stick to Python only. If you want to see great performance gains, make sure to use @numba.jit(nopython=True) (note that this will not allow you to use dictionaries and other Python types inside the JIT-compiled functions, but will make the code run much faster).

Note that some of those indicators you are working with may already have an efficient implementation in Pandas, so consider pre-computing them with Pandas and then pass the values (the NumPy array) to your Numba backtesting function.

like image 134
Peque Avatar answered Oct 23 '22 04:10

Peque