python statsmodels: Help using ARIMA model for time series

Tags:

ARIMA from statsmodels is giving me inaccurate answers for my output. I was wondering whether someone could help me understand what's wrong with my code.

This is a sample:

import pandas as pd
import numpy as np
import datetime as dt
from statsmodels.tsa.arima_model import ARIMA

# Setting up a data frame that looks twenty days into the past,
# and has linear data, from approximately 1 through 20
counts = np.arange(1, 21) + 0.2 * (np.random.random(size=(20,)) - 0.5)
start = dt.datetime.strptime("1 Nov 01", "%d %b %y")
daterange = pd.date_range(start, periods=20)
table = {"count": counts, "date": daterange}
data = pd.DataFrame(table)
data.set_index("date", inplace=True)

print data

               count
date
2001-11-01   0.998543
2001-11-02   1.914526
2001-11-03   3.057407
2001-11-04   4.044301
2001-11-05   4.952441
2001-11-06   6.002932
2001-11-07   6.930134
2001-11-08   8.011137
2001-11-09   9.040393
2001-11-10  10.097007
2001-11-11  11.063742
2001-11-12  12.051951
2001-11-13  13.062637
2001-11-14  14.086016
2001-11-15  15.096826
2001-11-16  15.944886
2001-11-17  17.027107
2001-11-18  17.930240
2001-11-19  18.984202
2001-11-20  19.971603

The rest of the code sets up the ARIMA model.

# Setting up ARIMA model
order = (2, 1, 2)
model = ARIMA(data, order, freq='D')
model = model.fit()
print model.predict(1, 20)

2001-11-02    1.006694
2001-11-03    1.056678
2001-11-04    1.116292
2001-11-05    1.049992
2001-11-06    0.869610
2001-11-07    1.016006
2001-11-08    1.110689
2001-11-09    0.945190
2001-11-10    0.882679
2001-11-11    1.139272
2001-11-12    1.094019
2001-11-13    0.918182
2001-11-14    1.027932
2001-11-15    1.041074
2001-11-16    0.898727
2001-11-17    1.078199
2001-11-18    1.027331
2001-11-19    0.978840
2001-11-20    0.943520
2001-11-21    1.040227
Freq: D, dtype: float64

As you could see, the data is just constant around 1 instead of increasing. What am I doing wrong here?

(On a side note, I can't pass in string dates like "2001-11-21" into the predict function for some reason. It would be helpful to know why.)

740

asked Jul 29 '15 02:07

hlin117

1 Answers

TL;DR

The way you use predict returns a linear prediction in terms of the differenced endogenous variable not a prediction of the levels of the original endogenous variable.

You must feed predict method with typ='levels' to change this behavior:

preds = fit.predict(1, 30, typ='levels')

See documentation of ARIMAResults.predict for details.

Step by step

Dataset

We load data you provided in your MCVE:

import io
import pandas as pd

raw = io.StringIO("""date        count
2001-11-01   0.998543
2001-11-02   1.914526
2001-11-03   3.057407
2001-11-04   4.044301
2001-11-05   4.952441
2001-11-06   6.002932
2001-11-07   6.930134
2001-11-08   8.011137
2001-11-09   9.040393
2001-11-10  10.097007
2001-11-11  11.063742
2001-11-12  12.051951
2001-11-13  13.062637
2001-11-14  14.086016
2001-11-15  15.096826
2001-11-16  15.944886
2001-11-17  17.027107
2001-11-18  17.930240
2001-11-19  18.984202
2001-11-20  19.971603""")

data = pd.read_fwf(raw, parse_dates=['date'], index_col='date')

As we may expect data are auto-correlated:

from pandas.plotting import autocorrelation_plot
autocorrelation_plot(data)

enter image description here

Model & Training

We create an ARIMA Model object for a given setup (P,D,Q) and we train it on our data using the fit method:

from statsmodels.tsa.arima_model import ARIMA

order = (2, 1, 2)
model = ARIMA(data, order, freq='D')
fit = model.fit()

It returns an ARIMAResults object which is matter of interest. We can check out the quality of our model:

fit.summary()

                            ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.count   No. Observations:                   19
Model:                 ARIMA(2, 1, 2)   Log Likelihood                  25.395
Method:                       css-mle   S.D. of innovations              0.059
Date:                Fri, 18 Jan 2019   AIC                            -38.790
Time:                        07:54:36   BIC                            -33.123
Sample:                    11-02-2001   HQIC                           -37.831
                         - 11-20-2001                                         
==============================================================================
                  coef    std err          z      P>|z|      [0.025     0.975]
------------------------------------------------------------------------------
const           1.0001      0.014     73.731      0.000       0.973      1.027
ar.L1.D.count  -0.3971      0.295     -1.346      0.200      -0.975      0.181
ar.L2.D.count  -0.6571      0.230     -2.851      0.013      -1.109     -0.205
ma.L1.D.count   0.0892      0.208      0.429      0.674      -0.318      0.496
ma.L2.D.count   1.0000      0.640      1.563      0.140      -0.254      2.254
                                    Roots                                    
==============================================================================
                   Real          Imaginary           Modulus         Frequency
------------------------------------------------------------------------------
AR.1            -0.3022           -1.1961j            1.2336           -0.2894
AR.2            -0.3022           +1.1961j            1.2336            0.2894
MA.1            -0.0446           -0.9990j            1.0000           -0.2571
MA.2            -0.0446           +0.9990j            1.0000            0.2571
------------------------------------------------------------------------------

And we can roughly estimate how residuals are distributed:

residuals = pd.DataFrame(fit.resid, columns=['residuals'])
residuals.plot(kind='kde')

enter image description here

Prediction

If we are satisfied with our model, then we can predict some in-sample or out-sample data.

This can be done with the predict method which by default returns the differenced endogenous variable not the endogenous variable itself. To change this behavior, we must specify typ='levels':

preds = fit.predict(1, 30, typ='levels')

Then our predictions do have the same levels of our training data:

enter image description here

Additionally, if we are interested to also have confidence intervals, then we can use the forecast method.

String Argument

It is also possible to feed predict with strings (always use the ISO-8601 format if you want to avoid troubles) or datetime objects:

preds = fit.predict("2001-11-02", "2001-12-15", typ='levels')

Works as expected on StatsModels 0.9.0:

import statsmodels as sm
sm.__version__ # '0.9.0'

164

answered Sep 27 '22 21:09

jlandercy

Related questions
                            
                                numpy interp decreasing xp
                            
                                Inter segment distance using numba jit, Python
                            
                                Key error on pandas merge (left join)
                            
                                Aggregate time series in python
                            
                                how can I do a maximum likelihood regression using scipy.optimize.minimize
                            
                                What is the optimal use of a lock with a try ... except in Python 2.7?
                            
                                Edit XML file text based on path
                            
                                Saving multiple figures to one pdf file in matplotlib
                            
                                What does “[sdist]” mean in PIP's list of outdated packages?
                            
                                Reassigning variables in python [duplicate]
                            
                                Calling a python script with input within a python script using subprocess
                            
                                What is a good converging colourmap for matplotlib
                            
                                Transporting Sparse Matrix from Python to R
                            
                                python pandas read_csv delimiter in column data
                            
                                What happens if you don't close a pyodbc connection?
                            
                                Prevent TextIOWrapper from closing on GC in a Py2/Py3 compatible way
                            
                                Importing python libraries from Github
                            
                                How to parallelise .predict() method of a scikit-learn SVM (SVC) Classifier?
                            
                                Python memory consumption in 64 bit system for int and float
                            
                                Development build of pandas giving importerror: C extension: 'hashtable' not built on python 3.4 (anaconda)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python statsmodels: Help using ARIMA model for time series

Tags:

python

statistics

statsmodels

hlin117

People also ask

1 Answers

TL;DR

Step by step

Dataset

Model & Training

Prediction

String Argument

jlandercy

Recent Activity

Donate For Us