Multiple period persistence, vectorization, time series python

Question

I have a DataFrame with daily values and I am building out a forecast using various methods predicting the values for the next two weeks.

As a base, naive, forecast I want to simply say the value today is the best forcast for the next two weeks e.g.:

the value on 01-Jan-2012 is 100, then I would like the forecast for 02-Jan-2012 to 15-Jan-2022 to be 100
the value on 02-Jan-2012 is 110, then I would like the forecast for 03-Jan-2012 to 16-Jan-2022 to be 110
etc

This method can then be compared to the other forecasts to see whether they add value over a naive approach.

To backtest this model how can I do this? I have a few years worth of data in a DataFrame, and I want to do something like below. Reading online, I can only find 1 day persistence help whereby simply using something like df.shift(1) does the job.

Pseudocode:
get the first row from the DataFrame
extract the date from the index
extract the value from the column
propogate forward this value for the next fourteen days
save these forecast dates and forecast values

get the second row from the DataFrame
extract the date from the index
extract the value from the column
propogate forward this value for the next fourteen days
save these forecast dates and forecast values

REPEAT...

However, I've read that iterating over rows is advised against and it is better to use something like pandas apply to 'vectorize' the data but I am not sure how to do this. I was thinking of writing a function to predict the next 14 days then using the apply method to call this function, but not sure how to do so or if this is the best way.

I've also read that numpy is very good for these sorts of problems, but again, am not too familiar.

I've set up a sqlite database so I can store forecasts in there if that helps.

chrslg · Accepted Answer

General principles

Here my shot at it. I use a function that does almost that: np.lib.stride_tricks.sliding_window. Except that it shows forward values, when what you want to do is backward values (at t, you want to see value at t-1 as "prediction J+1", value at t-2 as "prediction J+2", etc. And more generally speaking, you want last 14 values to build a prediction). But that is easy to solve : just add some dummy NaN at the beginning. So that "prediction J+1" is NaN in the first line (you can't predict value of the first line with value of yesterday, since you haven't any value of yesterday)

Results are very similar to the one of the two other answers you got so far. Same 2D dataframe. With different formatting (one is using dates and predictions as 2 indexes), and filling (one is filling unknown values with today value)

But timings are not.

Code

import pandas as pd
import numpy as np
import datetime

# Data generation
N=1000
dt=datetime.timedelta(days=1)
dates=[datetime.date(2012,1,1)+k*dt for k in range(N)]
temps=np.random.normal(100,15,(N,))
df = pd.DataFrame({'time':dates, 'temp':temps})

# slideWindowMethod
def mslide():
    T=np.concatenate(([np.nan]*14, df['temp'].values))
    TwithShift=np.lib.stride_tricks.sliding_window_view(T, 15)
    return pd.concat([df, pd.DataFrame(TwithShift[:,:-1], columns=[f'Pred_J+{i}' for i in range(14,0,-1)])], axis=1)

dfWithPred=mslide()

(Using arg less functions is just because it made it easier to use timeit. But of course, df, and 14 should be arguments)

Explanation

The main idea resides in this sliding_window_view function.

M=np.arange(10)
np.lib.stride_tricks.sliding_window_view(M, 4)

⇒

array([[0, 1, 2, 3],
       [1, 2, 3, 4],
       [2, 3, 4, 5],
       [3, 4, 5, 6],
       [4, 5, 6, 7],
       [5, 6, 7, 8],
       [6, 7, 8, 9]])

Note that it stops at line starting with 6, since we don't have future values for the others. Not a problem, see later. Other than that, 1st column is the original array.

And one of the beauty in it (tho we don't take advantage of it here, since we will build an array at the end) is that it is just a view. No new array is built here. So, even if you have 100 millions numbers in an array M, you can still get sliding_window_view(M,1000) array that contains all "shifts" of M for 0 to 1000, without allocating memory for 100 billions numbers. It can be very convenient to perform vectorize operations that use historical data.

Since here it shows future values, not past values (and this is what you want: past values, since saying "prediction of line J+1 is today value" is the same as saying that "prediction of today is value at J-1", with just different meaning of J), you can easily fill the array with NaN before shifting.

T=np.arange(10)
TwithNa=np.concatenate(([np.nan]*3, T))
np.lib.stride_tricks.sliding_window_view(TwithNa, 4)

⇒

array([[nan, nan, nan,  0.],
       [nan, nan,  0.,  1.],
       [nan,  0.,  1.,  2.],
       [ 0.,  1.,  2.,  3.],
       [ 1.,  2.,  3.,  4.],
       [ 2.,  3.,  4.,  5.],
       [ 3.,  4.,  5.,  6.],
       [ 4.,  5.,  6.,  7.],
       [ 5.,  6.,  7.,  8.],
       [ 6.,  7.,  8.,  9.]])

Just same array, but with 3 NaN before in the first column. And we have therefore 10 rows. With that shifting, it is last column that is identical to the original array. And columns before are the previous, previous-previous, etc, values.

Rest is just matter of fillig a dataframe with it.

Timings

But the real beauty of it, of course, is timing. Here my timeit results with my method, and the two other methods you got so far

Method	Time μs
Naive for	1022001.3829995878
Apply/join	158593.13219989417
ffill	67126.1526333789
Sliding window	343.24235070089344

You see, it is not a mere optimization. It is a game changer!

Note that, generally speaking apply is not much better than a naive for loop accessing dataframe with .iloc. Sometimes it is even slower. It is most of the time very disappointing. In this case, it is not that bad. ×6 time gain, it is often far less than that. Yet, nothing compared to ×3000 you get without. So when you say

However, I've read that iterating over rows is advised against and it is better to use something like pandas apply to 'vectorize' the data but I am not sure how to do this.

I obviously agree with the first part. But I would put apply on the same shelf as iterating over rows. Most of the time it is the same. And it always better to try to avoid it.

Multiple period persistence, vectorization, time series python

Tags:

python

pandas

numpy

apply

time-series

13sen1

1 Answers

General principles

Code

Explanation

Timings

chrslg

Recent Activity

Donate For Us

Multiple period persistence, vectorization, time series python

Tags:

python

pandas

numpy

apply

time-series

13sen1

1 Answers

General principles

Code

Explanation

Timings

chrslg

Related questions

Recent Activity

Donate For Us