I have a .csv file containing a 5-year time series, with hourly resolution (commoditiy price). Based on the historical data, I want to create a forecast of the prices for the 6th year. I have read a couple of articles on the www about these type of procedures, and I basically based my code on the code posted there, since my knowledge in both Python (especially statsmodels) and statistic is at most limited. Those are the links, for those who are interested: http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/ http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/ First of all, here is a sample of the .csv file. Data is displayed with monthly resolution in this case, it is not real data, just randomly choosen numbers to give an example here (in which case I hope one year is enough to be able to develop a forecast for the 2nd year; if not, full csv file is available): <pre class="prettyprint"><code> Price 2011-01-31 32.21 2011-02-28 28.32 2011-03-31 27.12 2011-04-30 29.56 2011-05-31 31.98 2011-06-30 26.25 2011-07-31 24.75 2011-08-31 25.56 2011-09-30 26.68 2011-10-31 29.12 2011-11-30 33.87 2011-12-31 35.45 </code></pre> My current progress is as follows: After reading the input file and setting the date column as datetime index, the follwing script was used to develop a forecast for the available data <pre class="prettyprint"><code>model = sm.tsa.ARIMA(df['Price'].iloc[1:], order=(1, 0, 0)) results = model.fit(disp=-1) df['Forecast'] = results.fittedvalues df[['Price', 'Forecast']].plot(figsize=(16, 12)) </code></pre> ,which gives the following output: <img src="https://i.stack.imgur.com/HRYcT.png" alt="5-year timeseries, hourly resolution data"> Now, as I said, I ain't got no statistic skills and I have little to no idea how I got to this output (basically, changing the order attribute inside the first line changes the output), but the 'actual' forecast looks quite good and I would like to extend it for another year (2016). In order to do that, additional rows are created in the dataframe, as follows: <pre class="prettyprint"><code>start = datetime.datetime.strptime("2016-01-01", "%Y-%m-%d") date_list = pd.date_range('2016-01-01', freq='1D', periods=366) future = pd.DataFrame(index=date_list, columns= df.columns) data = pd.concat([df, future]) </code></pre> Finally, when I use the .predict function of statsmodels: <pre class="prettyprint"><code>data['Forecast'] = results.predict(start = 1825, end = 2192, dynamic= True) data[['Price', 'Forecast']].plot(figsize=(12, 8)) </code></pre> what I get as forecast is a straight line (see below), which doesn't seem at all like a forecast. Moreover, if I extend the range, which now is from the 1825th to 2192nd day (year of 2016), to the whole 6 year timespan, the forecast line is a straight line for the entire period (2011-2016). I have also tried to use the 'statsmodels.tsa.statespace.sarimax.SARIMAX.predict' method, which accounts for a seasonal variation (which makes sense in this case), but I get some error about 'module' has no attribute 'SARIMAX'. But this is secondary problem, will get into more detail if needed. <img src="https://i.stack.imgur.com/ZCfQf.png" alt="forecast output"> Somewhere I am losing grip and I have no idea where. Thanks for reading. Cheers!

It sounds like you are using an older version of statsmodels that does not support SARIMAX. You'll want to install the latest released version 0.8.0 see http://statsmodels.sourceforge.net/devel/install.html. I'm using Anaconda and installed via pip. <pre class="prettyprint"><code>pip install -U statsmodels </code></pre> The results class from the SARIMAX model have a number of useful methods including forecast. <pre class="prettyprint"><code>data['Forecast'] = results.forecast(100) </code></pre> Will use your model to forecast 100 steps into the future.

ARIMA(1,0,0) is a one period autoregressive model. So it's a model that follows this formula: <img src="https://i.stack.imgur.com/ZA1wa.png" alt="enter image description here"> What that means is that the value in time period t is equal to some constant (phi_0) plus a value determined by fitting the ARMA model (phi_1) multiplied by the value in the prior period r_(t-1), plus a white noise error term (a_t). Your model only has a memory of 1 period, so the current prediction is entirely determined by the 1 value of the prior period. It's not a very complex model; it's not doing anything fancy with all the prior values. It's just taking yesterday's price, multiplying it by some value and adding a constant. You should expect it to quickly go to equilibrium and then stay there forever. The reason why the forecast in the top picture looks so good is that it is just showing you hundreds of 1 period forecasts that are starting fresh with each new period. It's not showing a long period prediction like you probably think it is. Looking at the link you sent: http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/ read the section where he discusses why this model doesn't give you what you want. "So at first glance it seems like this model is doing pretty well. But although it appears like the forecasts are really close (the lines are almost indistinguishable after all), remember that we used the un-differenced series! The index only fluctuates a small percentage day-to-day relative to the total absolute value. What we really want is to predict the first difference, or the day-to-day moves. We can either re-run the model using the differenced series, or add an "I" term to the ARIMA model (resulting in a (1, 1, 0) model) which should accomplish the same thing. Let's try using the differenced series." To do what you're trying to do, you'll need to do more research into these models and figure out how to format your data, and what model will be appropriate. The most important thing is knowing what information you believe is contained in the data you're feeding into the model. What your model currently is trying to do is say, "Today the price is $45. What will the price be tomorrow?" That's it. It doesn't have any information about momentum, volatility, etc. That's not much to go off.

Forecasting with statsmodels

Tags:

python

statsmodels

forecasting

I have a .csv file containing a 5-year time series, with hourly resolution (commoditiy price). Based on the historical data, I want to create a forecast of the prices for the 6th year.

I have read a couple of articles on the www about these type of procedures, and I basically based my code on the code posted there, since my knowledge in both Python (especially statsmodels) and statistic is at most limited.

Those are the links, for those who are interested:

http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/

http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/

First of all, here is a sample of the .csv file. Data is displayed with monthly resolution in this case, it is not real data, just randomly choosen numbers to give an example here (in which case I hope one year is enough to be able to develop a forecast for the 2nd year; if not, full csv file is available):

              Price
2011-01-31    32.21
2011-02-28    28.32
2011-03-31    27.12
2011-04-30    29.56
2011-05-31    31.98
2011-06-30    26.25
2011-07-31    24.75
2011-08-31    25.56
2011-09-30    26.68
2011-10-31    29.12
2011-11-30    33.87
2011-12-31    35.45

My current progress is as follows:

After reading the input file and setting the date column as datetime index, the follwing script was used to develop a forecast for the available data

model = sm.tsa.ARIMA(df['Price'].iloc[1:], order=(1, 0, 0))  
results = model.fit(disp=-1)  
df['Forecast'] = results.fittedvalues  
df[['Price', 'Forecast']].plot(figsize=(16, 12))

,which gives the following output:

5-year timeseries, hourly resolution data

Now, as I said, I ain't got no statistic skills and I have little to no idea how I got to this output (basically, changing the order attribute inside the first line changes the output), but the 'actual' forecast looks quite good and I would like to extend it for another year (2016).

In order to do that, additional rows are created in the dataframe, as follows:

start = datetime.datetime.strptime("2016-01-01", "%Y-%m-%d")
date_list = pd.date_range('2016-01-01', freq='1D', periods=366)
future = pd.DataFrame(index=date_list, columns= df.columns)
data = pd.concat([df, future])

Finally, when I use the .predict function of statsmodels:

data['Forecast'] = results.predict(start = 1825, end = 2192, dynamic= True)  
data[['Price', 'Forecast']].plot(figsize=(12, 8))

what I get as forecast is a straight line (see below), which doesn't seem at all like a forecast. Moreover, if I extend the range, which now is from the 1825th to 2192nd day (year of 2016), to the whole 6 year timespan, the forecast line is a straight line for the entire period (2011-2016).

I have also tried to use the 'statsmodels.tsa.statespace.sarimax.SARIMAX.predict' method, which accounts for a seasonal variation (which makes sense in this case), but I get some error about 'module' has no attribute 'SARIMAX'. But this is secondary problem, will get into more detail if needed.

forecast output

Somewhere I am losing grip and I have no idea where. Thanks for reading. Cheers!

309

asked Aug 02 '16 14:08

davidr

2 Answers

It sounds like you are using an older version of statsmodels that does not support SARIMAX. You'll want to install the latest released version 0.8.0 see http://statsmodels.sourceforge.net/devel/install.html.

I'm using Anaconda and installed via pip.

pip install -U statsmodels

The results class from the SARIMAX model have a number of useful methods including forecast.

data['Forecast'] = results.forecast(100)

Will use your model to forecast 100 steps into the future.

152

answered Oct 16 '22 22:10

Geoff D

ARIMA(1,0,0) is a one period autoregressive model. So it's a model that follows this formula:

enter image description here

What that means is that the value in time period t is equal to some constant (phi_0) plus a value determined by fitting the ARMA model (phi_1) multiplied by the value in the prior period r_(t-1), plus a white noise error term (a_t).

Your model only has a memory of 1 period, so the current prediction is entirely determined by the 1 value of the prior period. It's not a very complex model; it's not doing anything fancy with all the prior values. It's just taking yesterday's price, multiplying it by some value and adding a constant. You should expect it to quickly go to equilibrium and then stay there forever.

The reason why the forecast in the top picture looks so good is that it is just showing you hundreds of 1 period forecasts that are starting fresh with each new period. It's not showing a long period prediction like you probably think it is.

Looking at the link you sent:

http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/

read the section where he discusses why this model doesn't give you what you want.

"So at first glance it seems like this model is doing pretty well. But although it appears like the forecasts are really close (the lines are almost indistinguishable after all), remember that we used the un-differenced series! The index only fluctuates a small percentage day-to-day relative to the total absolute value. What we really want is to predict the first difference, or the day-to-day moves. We can either re-run the model using the differenced series, or add an "I" term to the ARIMA model (resulting in a (1, 1, 0) model) which should accomplish the same thing. Let's try using the differenced series."

To do what you're trying to do, you'll need to do more research into these models and figure out how to format your data, and what model will be appropriate. The most important thing is knowing what information you believe is contained in the data you're feeding into the model. What your model currently is trying to do is say, "Today the price is $45. What will the price be tomorrow?" That's it. It doesn't have any information about momentum, volatility, etc. That's not much to go off.

answered Oct 16 '22 20:10

Troy D

Related questions
                            
                                How do I get TensorFlow's 'import_graph_def' to return Tensors
                            
                                element-wise operations of matrix in python
                            
                                How to use pelican to generate a hierarchical website, not a blog
                            
                                error :document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
                            
                                Finding all repeated substrings in a string and how often they appear
                            
                                are there any limitations on the number of locks a python program can create?
                            
                                How to hstack several sparse matrices (feature matrices)?
                            
                                overlapping dates between two date ranges in python
                            
                                using an numpy array as indices of the 2nd dim of another array? [duplicate]
                            
                                How to run async process from handler in aiohttp
                            
                                Barplot/line plot on same plot, but different axis and line plot in front of barplot
                            
                                Plotly python how to draw unbounded lines and spans?
                            
                                Make two Frames occupy 50% of the available width each?
                            
                                Using alembic with multiple databases
                            
                                Dot product of two sparse matrices affecting zero values only
                            
                                use existing field as _id using elasticsearch dsl python DocType
                            
                                Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0
                            
                                Working with variable-length text in Tensorflow
                            
                                Python - Parameter checking with Exception Raising
                            
                                Pandas - Unpack column of lists of varying lengths of tuples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With