how to vectorise Pandas calculation that is based on last x rows of data

Tags:

I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:

import pandas as pd
import numpy as np
from datetime import timedelta

df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')

def calculateC(A, dte):
    if A>0: #based on values has different cutoff length for trend prediction
        depth=10
    else:
        depth=20
    lastyear=(dte-timedelta(days=365)) 
    df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
    return df2.B.mean() #uses WLS in my model but for simplification replace with mean

for index, row in df.iterrows():
    if index>365:
        df.loc[index,'C']=calculateC(row.A, row.dte)

I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?

663

asked Jun 26 '16 04:06

desmond

1 Answers

I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.

df['result'] = np.where( df['A'] > 0,
                         df.shift(365).rolling(10).B.mean(),
                         df.shift(365).rolling(20).B.mean() )

The tough (slow) part of your code is this:

df2=df[df.dte<lastyear].head(depth)

However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:

df.shift(365).rolling(10).B.mean()

shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.

And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.

111

answered Oct 22 '22 01:10

JohnE

Related questions
                            
                                How to not store password in .pypirc?
                            
                                How to serialize a pyspark Pipeline object?
                            
                                Scrapy suppress handled errors
                            
                                Detecting whether a Flask app handles a URL
                            
                                expected string or buffer ,date_re.match(value) django error
                            
                                Python, Pandas: tz_localize AmbiguousTimeError: Cannot infer dst time with non DST dates
                            
                                ggplot python handling time data over many weeks at hourly resolution
                            
                                Force celery to use json in place of pickle
                            
                                How to speed up resample procedure in Pandas?
                            
                                iPython notebook not working in Pycharm
                            
                                Can't see application log in Google Cloud Logs
                            
                                How to initialise an integer array.array object with zeros in Python
                            
                                Combing 2D list of tuples and then sorting them in Python
                            
                                In Tensorflow, how to unravel the flattened indices obtained by tf.nn.max_pool_with_argmax?
                            
                                Watching generation lists during a program run
                            
                                python libclang bindings on Windows fail to initialize a translation unit from sublime text
                            
                                How to extract data from SQL query and assign it to Odoo class columns?
                            
                                How to identify non-printable KeyPress events in Tkinter
                            
                                How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?
                            
                                How to quickly calculate cosine similarity for large number of vectors in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to vectorise Pandas calculation that is based on last x rows of data

Tags:

python

pandas

desmond

People also ask

1 Answers

JohnE

Recent Activity

Donate For Us