I wanted to run a rolling 1000 window OLS regression estimation
of the dataset for my evaluation found at the following URL:
https://drive.google.com/open?id=0B2Iv8dfU4fTUa3dPYW5tejA0bzg
I tried using the following Python
script with pandas
version 0.20.2
.
# /usr/bin/python -tt
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.formula.api import ols
df = pd.read_csv('estimated.csv', names=('x','y'))
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['y']],
window_type='rolling', window=1000, intercept=True)
df['Y_hat'] = model.y_predict
However, when I run my Python
script, I am getting this error: AttributeError: module 'pandas.stats' has no attribute 'ols'
. I found out the reason for this error is because it is removed since Pandas
version 0.20.0
as we can see it from the following link.
https://github.com/pandas-dev/pandas/pull/11898
How can we do OLS Regression
with the latest version of Pandas?
The Ordinary Least Squares (OLS) regression technique falls under the Supervised Learning. It is a method for estimating the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.
We will provide every effort to provide compatibility with older versions of pandas, however. We have implemented a very fast set of moving-window linear regression classes in pandas.
While normally I would suggest applying something like statsmodels.ols
on a rolling basis*, your dataset is large (length-1000 windows on 258k rows) and you will run into a memory error that way. Therefore, you could use the linear algebra approach to calculating coefficients and then apply these coefficients to each window of your explanatory variable. For more on this, see A Matrix Formulation of the Multiple Regression Model.
* To see an implementation of statsmodels, see a wrapper I created here. An example is here.
Realize that yhat
here is not an nx1 vector--it is a bunch of nx1 vectors stacked on top of each other, i.e. you have 1 set of predictions per rolling 1000-period block. So the shape of your predictions will be (257526, 1000), as shown below.
import numpy as np
import pandas as pd
df = pd.read_csv('input/estimated.csv', names=('x','y'))
def rolling_windows(a, window):
"""Creates rolling-window 'blocks' of length `window` from `a`.
Note that the orientation of rows/columns follows that of pandas.
Example
=======
onedim = np.arange(20)
twodim = onedim.reshape((5,4))
print(twodim)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
print(rwindows(onedim, 3)[:5])
[[0 1 2]
[1 2 3]
[2 3 4]
[3 4 5]
[4 5 6]]
print(rwindows(twodim, 3)[:5])
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]]
"""
if isinstance(a, (Series, DataFrame)):
a = a.values
if a.ndim == 1:
a = a.reshape(-1, 1)
shape = (a.shape[0] - window + 1, window) + a.shape[1:]
strides = (a.strides[0],) + a.strides
windows = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return np.squeeze(windows)
def coefs(y, x):
return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))
rendog = rolling_windows(df.x.values, 1000)
rexog = rolling_windows(df.drop('x', axis=1).values, 1000)
preds = list()
for endog, exog in zip(rendog, rexog):
pred = np.sum(coefs(endog, exog).T * exog, axis=1)
preds.append(pred)
preds = np.array(preds)
print(preds.shape)
(257526, 1000)
Lastly: have you considered using a Random Forest Classifier here, given that your y
variable is discrete?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With