Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to do OLS Regression with the latest version of Pandas

I wanted to run a rolling 1000 window OLS regression estimation of the dataset for my evaluation found at the following URL:


I tried using the following Python script with pandas version 0.20.2.

# /usr/bin/python -tt

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.formula.api import ols

df = pd.read_csv('estimated.csv', names=('x','y'))

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['y']], 
                               window_type='rolling', window=1000, intercept=True)
df['Y_hat'] = model.y_predict

However, when I run my Python script, I am getting this error: AttributeError: module 'pandas.stats' has no attribute 'ols'. I found out the reason for this error is because it is removed since Pandas version 0.20.0as we can see it from the following link.


How can we do OLS Regression with the latest version of Pandas?

like image 291
Desta Haileselassie Hagos Avatar asked Jun 22 '17 21:06

Desta Haileselassie Hagos

People also ask

How does OLS work in Python?

The Ordinary Least Squares (OLS) regression technique falls under the Supervised Learning. It is a method for estimating the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.

Does pandas do regression?

We will provide every effort to provide compatibility with older versions of pandas, however. We have implemented a very fast set of moving-window linear regression classes in pandas.

1 Answers

While normally I would suggest applying something like statsmodels.ols on a rolling basis*, your dataset is large (length-1000 windows on 258k rows) and you will run into a memory error that way. Therefore, you could use the linear algebra approach to calculating coefficients and then apply these coefficients to each window of your explanatory variable. For more on this, see A Matrix Formulation of the Multiple Regression Model.

* To see an implementation of statsmodels, see a wrapper I created here. An example is here.

Realize that yhat here is not an nx1 vector--it is a bunch of nx1 vectors stacked on top of each other, i.e. you have 1 set of predictions per rolling 1000-period block. So the shape of your predictions will be (257526, 1000), as shown below.

import numpy as np
import pandas as pd

df = pd.read_csv('input/estimated.csv', names=('x','y'))

def rolling_windows(a, window):
    """Creates rolling-window 'blocks' of length `window` from `a`.

    Note that the orientation of rows/columns follows that of pandas.

    onedim = np.arange(20)
    twodim = onedim.reshape((5,4))

    [[ 0  1  2  3]
     [ 4  5  6  7]
     [ 8  9 10 11]
     [12 13 14 15]
     [16 17 18 19]]

    print(rwindows(onedim, 3)[:5])
    [[0 1 2]
     [1 2 3]
     [2 3 4]
     [3 4 5]
     [4 5 6]]

    print(rwindows(twodim, 3)[:5])
    [[[ 0  1  2  3]
      [ 4  5  6  7]
      [ 8  9 10 11]]

     [[ 4  5  6  7]
      [ 8  9 10 11]
      [12 13 14 15]]

     [[ 8  9 10 11]
      [12 13 14 15]
      [16 17 18 19]]]

    if isinstance(a, (Series, DataFrame)):
        a = a.values
    if a.ndim == 1:
        a = a.reshape(-1, 1)
    shape = (a.shape[0] - window + 1, window) + a.shape[1:]
    strides = (a.strides[0],) + a.strides
    windows = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return np.squeeze(windows)

def coefs(y, x):
    return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))

rendog = rolling_windows(df.x.values, 1000)
rexog = rolling_windows(df.drop('x', axis=1).values, 1000)

preds = list()
for endog, exog in zip(rendog, rexog):
    pred = np.sum(coefs(endog, exog).T * exog, axis=1)
preds = np.array(preds)

(257526, 1000)

Lastly: have you considered using a Random Forest Classifier here, given that your y variable is discrete?

like image 141
Brad Solomon Avatar answered Oct 18 '22 19:10

Brad Solomon