Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do you create a linear regression forecast on time series data in python

Tags:

python

I need to be able to create a python function for forecasting based on linear regression model with confidence bands on time-series data:

The function needs to take an argument specifying how far out to forecast. For example 1 day, 7 days, 30 days, 90 days etc. Depending on the argument, it will need to create Holt-Winters forcasting with confidence bands:

My time series data looks like this:

print series  [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}] 

The function should append the forecasted values to the above time-series data called "series" and return series:

[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]},{"target": "Forecast", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Upper", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Lower", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]}] 

Has anyone done something like this in python? Any ideas how to start?

like image 877
user1471980 Avatar asked Jun 30 '15 20:06

user1471980


People also ask

Can linear regression be used for time series forecasting?

Of course you can use linear regression for time series data. It's just that there are specific tools that only work for time series data that sometimes do a better job.


2 Answers

In the text of your question, you clearly state that you would like upper and lower bounds on your regression output, as well as the output prediction. You also mention using Holt-Winters algorithms for forecasting in particular.

The packages suggested by other answerers are useful, but you might note that sklearn LinearRegression does not give you error bounds "out of the box", statsmodels does not provide Holt-Winters right now.

Therefore, I suggest try using this implementation of Holt-Winters. Unfortunately its license is unclear, so I can't reproduce it here in full. Now, I'm not sure whether you actually want Holt-Winters (seasonal) prediction, or Holt's linear exponential smoothing algorithm. I'm guessing the latter given the title of the post. Thus, you can use the linear() function of the linked library. The technique is described in detail here for interested readers.

In the interests of not providing a link only answer - I'll describe the main features here. A function is defined that takes the data i.e.

 def linear(x, fc, alpha = None, beta = None): 

x is the data to be fit, fc is the number of timesteps that you want to forecast, alpha and beta take their usual Holt-Winters meanings: roughly a parameter to control the amount of smoothing to the "level" and to the "trend" respectively. If alpha or beta are not specified, they are estimated using scipy.optimize.fmin_l_bfgs_b to minimise the RMSE.

The function simply applies the Holt algorithm by looping through the existing data points and then returns the forecast as follows:

 return Y[-fc:], alpha, beta, rmse 

where Y[-fc:] are the forecast points, alpha and beta are the values actually used and rmse is the root mean squared error. Unfortunately, as you can see, there are no upper or lower confidence intervals. By the way - we should probably refer to them as prediction intervals.

Prediction intervals maths

Holt's algorithm and Holt-Winters algorithm are exponential smoothing techniques and finding confidence intervals for predictions generated from them is a tricky subject. They have been referred to as "rule of thumb" methods and, in the case of the Holt-Winters multiplicative algorithm, without "underlying statistical model". However, the final footnote to this page asserts that:

It is possible to calculate confidence intervals around long-term forecasts produced by exponential smoothing models, by considering them as special cases of ARIMA models. (Beware: not all software calculates confidence intervals for these models correctly.) The width of the confidence intervals depends on (i) the RMS error of the model, (ii) the type of smoothing (simple or linear); (iii) the value(s) of the smoothing constant(s); and (iv) the number of periods ahead you are forecasting. In general, the intervals spread out faster as α gets larger in the SES model and they spread out much faster when linear rather than simple smoothing is used.

We see here that an ARIMA(0,2,2) model is equivalent to a Holt linear model with additive errors

Prediction intervals code (i.e. how to proceed)

You indicate in comments that you "can easily do this in R". I guess you may be used to the holt() function provided by the forecast package in R and therefore expecting such intervals. In which case - you can adapt the python library to give them to you on the same basis.

Looking at the R holt code, we can see that it returns an object based on forecast(ets(...). Under the hood - this eventually calls to this function class1, which returns a mean mu and variance var (as well as cj which I have to confess I do not understand). The variance is used to calculate the upper and lower bounds here.

To do something similar in Python - we would need to produce something similar to the class1 R function that estimates the variance of each prediction. This function takes the residuals found in model fitting and multiplies them by a factor at each time step to get the variance at that timestep. In the particular case of the linear Holt's algorithm, the factor is the cumulative sum of alpha + k*beta where k is the number of timesteps' prediction. Once you have that variance at each prediction point, treat the errors as normally distributed and get the X% value from the normal distribution.

Here's an idea how to do this in Python (using the code I linked as your linear function)

#Copy, import or reimplement the RMSE and linear function from #https://gist.github.com/andrequeiroz/5888967  #factor in case there are not 1 timestep per day - in your case #assuming the timesteps are UTC epoch - I think they're 5 min # spaced i.e. 288 per day timesteps_per_day = 288  # Note - big assumption here - your known data will be regular in time # i.e. timesteps_per_day observations per day.  From the timestamps this seems valid. # if you can't guarantee that - you'll need to interpolate the data def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):     forecast_timesteps = forecast_days*timesteps_per_day     middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))     cum_error = [beta+alpha]     for k in range(1,forecast_timesteps):         cum_error.append(cum_error[k-1] + k*beta + alpha)      cum_error = np.array(cum_error)     #Use some numpy multiplication to get the intervals     var = cum_error * rmse**2     # find the correct ppf on the normal distribution (two-sided)     p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))     interval = np.sqrt(var) * p     upper = middle_predictions + interval     lower = middle_predictions - interval     fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]      ret_value = []      ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})     ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})     ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})     return ret_value  if __name__=='__main__':     import numpy as np     import scipy.stats     from math import sqrt      null = None     data_in = [{"target": "average", "datapoints": [[null, 1435688679],     [34.870499801635745, 1435688694], [null, 1435688709], [null,     1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],     [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,     1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],     [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,     1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],     [38.180000209808348, 1435688994], [null, 1435689009], [null,     1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],     [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,     1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],     [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,     1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],     [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],     [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,     1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],     [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,     1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],     [null, 1435689549], [null, 1435689564]]}]      #translate the data.  There may be better ways if you're     #prepared to use pandas / input data is proper json     time_series = data_in[0]["datapoints"]      epoch_in = []     Y_observed = []      for (y,x) in time_series:         if y and x:             epoch_in.append(x)             Y_observed.append(y)      #Pass in the number of days to forecast     fcast_days = 30     res = holt_predict(Y_observed,epoch_in,fcast_days)     data_out = data_in + res     #data_out now holds the data as you wanted it.      #Optional plot of results     import matplotlib.pyplot as plt     plt.plot(epoch_in,Y_observed)     m,tstamps = zip(*res[0]['datapoints'])     u,tstamps = zip(*res[1]['datapoints'])     l,tstamps = zip(*res[2]['datapoints'])     plt.plot(tstamps,u, label='upper')     plt.plot(tstamps,l, label='lower')     plt.plot(tstamps,m, label='mean')     plt.show() 

N.B. The output I've given adds points as tuple type into your object. If you really need list, then replace zip(upper,fcast_timestamps) with map(list,zip(upper,fcast_timestamps)) where the code adds upper, lower and Forecast dicts to the result.

This code is for the particular case of the Holt's linear algorithm - it is not a generic way to calculate correct prediction intervals.

Important note

Your sample input data seems to have a lot of null and only 3 genuine data points. This is highly unlikely to be a good basis for doing timeseries prediction - especially as they all seem to be with 15 minutes and you're trying to forecast up to 3 months!. Indeed - if you feed that data into the R holt(), it will say:

You've got to be joking. I need more data!

i'm assuming you have a larger dataset to test on. I tried the code above on the stock market opening prices for 2015 and it seemed to give reasonable results (see below).

Code used on 2015 stock market opening prices

You may think the prediction intervals look a little wide. This blog from the author of the R forecast module implies that is intentional, though :

"con­fi­dence inter­vals for the mean are much nar­rower than pre­dic­tion inter­vals"

like image 132
J Richard Snape Avatar answered Oct 03 '22 01:10

J Richard Snape


Scikit is a great module for python

The X and Y vars must be separated into two arrays where if you were to plot them (X,Y) the index from one would match with the other array.

So I guess in your time series data you would separate the X and Y values as follows:

null = None time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]   # assuming the time series is the X axis value  input_X_vars = [] input_Y_vars = []  for pair in time_series[0]["datapoints"]:     if (pair[0] != None and pair[1] != None):         input_X_vars.append(pair[1])         input_Y_vars.append(pair[0])    import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model  regr = linear_model.LinearRegression() regr.fit(input_X_vars, input_Y_vars)  test_X_vars = [1435688681, 1435688685]  results = regr.predict(test_X_vars)  forecast_append = {"target": "Lower", "datapoints": results}  time_series.append(forecast_append) 

I set null as None as the 'null' keyword is parsed as simply a variable in python.

like image 34
Parsa Avatar answered Oct 03 '22 03:10

Parsa