Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to vectorise Pandas calculation that is based on last x rows of data

Tags:

python

pandas

I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:

import pandas as pd
import numpy as np
from datetime import timedelta

df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')

def calculateC(A, dte):
    if A>0: #based on values has different cutoff length for trend prediction
        depth=10
    else:
        depth=20
    lastyear=(dte-timedelta(days=365)) 
    df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
    return df2.B.mean() #uses WLS in my model but for simplification replace with mean

for index, row in df.iterrows():
    if index>365:
        df.loc[index,'C']=calculateC(row.A, row.dte)

I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?

like image 663
desmond Avatar asked Jun 26 '16 04:06

desmond


People also ask

What is a correct pandas method for returning the last rows?

The tail() method returns the last n rows. By default, the last 5 rows are returned. You can specify the number of rows.

Which pandas method is used to extract the last rows of the table?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.

Which method of pandas can be used for returning last 5 rows?

Pandas tail() method is used to return bottom n (5 by default) rows of a data frame or series.


1 Answers

I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.

df['result'] = np.where( df['A'] > 0,
                         df.shift(365).rolling(10).B.mean(),
                         df.shift(365).rolling(20).B.mean() )

The tough (slow) part of your code is this:

df2=df[df.dte<lastyear].head(depth)

However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:

df.shift(365).rolling(10).B.mean()

shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.

And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.

like image 111
JohnE Avatar answered Oct 22 '22 01:10

JohnE