Assume we have a time-series data that contains the daily orders count of last two years:
We can predict the future's orders using Python's statsmodels library:
fit = statsmodels.api.tsa.statespace.SARIMAX(
train.Count, order=(2, 1, 4),seasonal_order=(0,1,1,7)
).fit()
y_hat_avg['SARIMA'] = fit1.predict(
start="2018-06-16", end="2018-08-14", dynamic=True
)
Result (don't mind the numbers):
Now assume that our input data has some unusual increase or decrease, because of holidays or promotions in the company. So we added two columns that tell if each day was a "holiday" and a day that the company has had "promotion".
Is there a method (and a way of implementing it in Python) to use this new type of input data and help the model to understand the reason of outliers, and also predict the future's orders with providing "holiday" and "promotion_day" information?
fit1.predict('2018-08-29', holiday=True, is_promotion=False)
# or
fit1.predict(start="2018-08-20", end="2018-08-25", holiday=[0,0,0,1,1,0], is_promotion=[0,0,1,1,0,1])
There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps.
In the time series prediction, it is common to use the historical value of the target variable to predict its future value. If the target variable depends on multiple attributes and each attribute forms a time series prediction, how could we make use of these attributes to predict the future value?
A univariate time series dataset is only comprised of a sequence of observations. These must be transformed into input and output features in order to use supervised learning algorithms.
We can use feature importance to help to estimate the relative importance of contrived input features for time series forecasting. This is important because we can contrive not only the lag observation features above, but also features based on the timestamp of observations, rolling statistics, and much more.
SARIMAX
, as a generalisation of the SARIMA
model, is designed to handle exactly this. From the docs,
Parameters:
- endog (array_like) – The observed time-series process y;
- exog (array_like, optional) – Array of exogenous regressors, shaped
(nobs, k)
.
You could pass the holiday
and promotion_day
as an array of size (nobs, 2)
to exog
, which will inform the model of the exogenous nature of some of these observations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With