Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multivariate time series forecasting with 3 months dataset

I have 3 months of data (each row corresponding to each day) generated and I want to perform a multivariate time series analysis for the same :

the columns that are available are -

Date    Capacity_booked Total_Bookings  Total_Searches  %Variation

Each Date has 1 entry in the dataset and has 3 months of data and I want to fit a multivariate time series model to forecast other variables as well.

So far, this was my attempt and I tried to achieve the same by reading articles.

I did the same -

df['Date'] = pd.to_datetime(Date , format = '%d/%m/%Y')

data = df.drop(['Date'], axis=1)

data.index = df.Date

from statsmodels.tsa.vector_ar.vecm import coint_johansen
johan_test_temp = data
coint_johansen(johan_test_temp,-1,1).eig



#creating the train and validation set
train = data[:int(0.8*(len(data)))]
valid = data[int(0.8*(len(data))):]

freq=train.index.inferred_freq

from statsmodels.tsa.vector_ar.var_model import VAR

model = VAR(endog=train,freq=train.index.inferred_freq)
model_fit = model.fit()


# make prediction on validation
prediction = model_fit.forecast(model_fit.data, steps=len(valid))

cols = data.columns

pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols])
    for j in range(0,4):
        for i in range(0, len(prediction)):
           pred.iloc[i][j] = prediction[i][j]

I have a validation set and prediction set. However the predictions are way worse than expected.

The plots of the dataset are - 1. % Variation enter image description here

  1. Capacity_Booked enter image description here

  2. Total bookings and searches enter image description here

The output that I am receiving are -

Prediction dataframe -

enter image description here

Validation Dataframe -

enter image description here

As you can see that predictions are way off what is expected. Can anyone advise a way to improve the accuracy. Also, if I fit the model on whole data and then print the forecasts, it doesn't take into account that new month has started and hence to predict as such. How can that be incorporated in here. any help is appreciated.

EDIT

Link to the dataset - Dataset

Thanks

like image 477
dper Avatar asked Nov 17 '19 13:11

dper


1 Answers

One manner to improve your accuracy is to look to the autocorrelation of each variable, as suggested in the VAR documentation page:

https://www.statsmodels.org/dev/vector_ar.html

The bigger the autocorrelation value is for a specific lag, the more useful this lag will be to the process.

Another good idea is to look to the AIC criterion and the BIC criterion to verify your accuracy (the same link above has an example of usage). Smaller values indicate that there is a bigger probability that you have found the true estimator.

This way, you can vary the order of your autoregressive model and see the one that provides the lowest AIC and BIC, both analyzed together. If AIC indicates the best model is with lag of 3 and the BIC indicates the best model has a lag of 5, you should analyze the values of 3,4 and 5 to see the one with best results.

The best scenario would be to have more data (as 3 months is not much), but you can try these approaches to see if it helps.

like image 184
Danilo Nunes Avatar answered Oct 02 '22 15:10

Danilo Nunes