I have a DataFrame with daily values and I am building out a forecast using various methods predicting the values for the next two weeks.
As a base, naive, forecast I want to simply say the value today is the best forcast for the next two weeks e.g.:
01-Jan-2012
is 100
, then I would like the forecast for 02-Jan-2012
to 15-Jan-2022
to be 100
02-Jan-2012
is 110
, then I would like the forecast for 03-Jan-2012
to 16-Jan-2022
to be 110
This method can then be compared to the other forecasts to see whether they add value over a naive approach.
To backtest this model how can I do this? I have a few years worth of data in a DataFrame, and I want to do something like below. Reading online, I can only find 1 day persistence help whereby simply using something like df.shift(1)
does the job.
Pseudocode:
get the first row from the DataFrame
extract the date from the index
extract the value from the column
propogate forward this value for the next fourteen days
save these forecast dates and forecast values
get the second row from the DataFrame
extract the date from the index
extract the value from the column
propogate forward this value for the next fourteen days
save these forecast dates and forecast values
REPEAT...
However, I've read that iterating over rows is advised against and it is better to use something like pandas apply
to 'vectorize' the data but I am not sure how to do this. I was thinking of writing a function to predict the next 14 days then using the apply
method to call this function, but not sure how to do so or if this is the best way.
I've also read that numpy is very good for these sorts of problems, but again, am not too familiar.
I've set up a sqlite database so I can store forecasts in there if that helps.
Here my shot at it.
I use a function that does almost that: np.lib.stride_tricks.sliding_window
.
Except that it shows forward values, when what you want to do is backward values (at t, you want to see value at t-1 as "prediction J+1", value at t-2 as "prediction J+2", etc. And more generally speaking, you want last 14 values to build a prediction).
But that is easy to solve : just add some dummy NaN at the beginning. So that "prediction J+1" is NaN in the first line (you can't predict value of the first line with value of yesterday, since you haven't any value of yesterday)
Results are very similar to the one of the two other answers you got so far. Same 2D dataframe. With different formatting (one is using dates and predictions as 2 indexes), and filling (one is filling unknown values with today value)
But timings are not.
import pandas as pd
import numpy as np
import datetime
# Data generation
N=1000
dt=datetime.timedelta(days=1)
dates=[datetime.date(2012,1,1)+k*dt for k in range(N)]
temps=np.random.normal(100,15,(N,))
df = pd.DataFrame({'time':dates, 'temp':temps})
# slideWindowMethod
def mslide():
T=np.concatenate(([np.nan]*14, df['temp'].values))
TwithShift=np.lib.stride_tricks.sliding_window_view(T, 15)
return pd.concat([df, pd.DataFrame(TwithShift[:,:-1], columns=[f'Pred_J+{i}' for i in range(14,0,-1)])], axis=1)
dfWithPred=mslide()
(Using arg less functions is just because it made it easier to use timeit. But of course, df, and 14 should be arguments)
The main idea resides in this sliding_window_view
function.
M=np.arange(10)
np.lib.stride_tricks.sliding_window_view(M, 4)
⇒
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
Note that it stops at line starting with 6, since we don't have future values for the others. Not a problem, see later. Other than that, 1st column is the original array.
And one of the beauty in it (tho we don't take advantage of it here, since we will build an array at the end) is that it is just a view. No new array is built here. So, even if you have 100 millions numbers in an array M, you can still get sliding_window_view(M,1000)
array that contains all "shifts" of M for 0 to 1000, without allocating memory for 100 billions numbers.
It can be very convenient to perform vectorize operations that use historical data.
Since here it shows future values, not past values (and this is what you want: past values, since saying "prediction of line J+1 is today value" is the same as saying that "prediction of today is value at J-1", with just different meaning of J), you can easily fill the array with NaN before shifting.
T=np.arange(10)
TwithNa=np.concatenate(([np.nan]*3, T))
np.lib.stride_tricks.sliding_window_view(TwithNa, 4)
⇒
array([[nan, nan, nan, 0.],
[nan, nan, 0., 1.],
[nan, 0., 1., 2.],
[ 0., 1., 2., 3.],
[ 1., 2., 3., 4.],
[ 2., 3., 4., 5.],
[ 3., 4., 5., 6.],
[ 4., 5., 6., 7.],
[ 5., 6., 7., 8.],
[ 6., 7., 8., 9.]])
Just same array, but with 3 NaN before in the first column. And we have therefore 10 rows. With that shifting, it is last column that is identical to the original array. And columns before are the previous, previous-previous, etc, values.
Rest is just matter of fillig a dataframe with it.
But the real beauty of it, of course, is timing. Here my timeit results with my method, and the two other methods you got so far
Method | Time μs |
---|---|
Naive for | 1022001.3829995878 |
Apply/join | 158593.13219989417 |
ffill | 67126.1526333789 |
Sliding window | 343.24235070089344 |
You see, it is not a mere optimization. It is a game changer!
Note that, generally speaking apply is not much better than a naive for loop accessing dataframe with .iloc
. Sometimes it is even slower. It is most of the time very disappointing. In this case, it is not that bad. ×6 time gain, it is often far less than that. Yet, nothing compared to ×3000 you get without. So when you say
However, I've read that iterating over rows is advised against and it is better to use something like pandas apply to 'vectorize' the data but I am not sure how to do this.
I obviously agree with the first part. But I would put apply
on the same shelf as iterating over rows. Most of the time it is the same. And it always better to try to avoid it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With