Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

I am performing component wise regression on a time series data. This is basically where instead of regressing y against x1, x2, ..., xN, we would regress y against x1 only, y against x2 only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against xi (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.

The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.

Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against xi linear regression as such. The z_matrix is the data matrix and the i represents the ith column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500.

    ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit()
    return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]

I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower:

    # slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions
    params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog)
    y_hat = self.z_matrix[:, [i]]@params
    ssr = sum((y_hat-endog)**2)
    return params, ssr, y_hat

Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!

like image 978
Lim Kaizhuo Avatar asked Jun 26 '20 09:06

Lim Kaizhuo


People also ask

Can you run a regression with one variable?

Linear Regression with one variable is also called as “univariate linear regression”. This is just more fancy way to call it. Linear regression with one variable is used when you want to predict a single output value from a single input value. That means you only have one x as input(attribute) and one y as output.

Can we plot multiple linear regression in Python?

Plot the ResultsUse xlabel to label the x-axis and use ylabel to label the y-axis. Regression plot of our model. A regression plot is useful to understand the linear relationship between two parameters. It creates a regression line in-between those parameters and then plots a scatter plot of those data points.

Is simple linear regression fast?

Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B1) that minimizes the total error (e) of the model.


1 Answers

Here is one way since you are always running regressions without a constant. This code runs around 900K models in about 0.5s. It retains the sse, the predicted values for each of the 900K regressions, and the estimated parameters.

The big idea is to exploit the math behind regressions of one variable on another, which is the ratio of a cross-product to an inner product (which the model does not contain a constant). This could be modified to also include a constant by using a moving window demean to estimate the intercept.

import numpy as np
from statsmodels.regression.linear_model import OLS
import datetime

gen = np.random.default_rng(20210514)

# Number of observations
n = 1000
# Number of predictors
m = 1000
# Window size
w = 100
# Simulate data
y = gen.standard_normal((n, 1))
x = gen.standard_normal((n, m))

now = datetime.datetime.now()
# Compute rolling covariance and variance-like terms
# These assume the model is y = x*b + e w/o a constant
c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)]
v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)]
c_trimmed = c[w:] - c[:-w]
v_trimmed = v[w:] - v[:-w]
# Parameters are just the ratio
params = c_trimmed / v_trimmed

# Build a selector array to quickly reshape y and the columns of x
step = np.arange(m - w + 1)
sel = np.arange(w)
locs = step[:, None] + sel
# Get the blocked reshape of y. It has n - w + 1 rows with window observations
# and looks like
# [[y[0],y[1],...,y[99]],
#  [y[1],y[2],...,y[100]],
#  ...,
#  [y[900],y[901],...,y[999]],
y_block = y[locs, 0]
# Storage for the predicted values and the sse
y_pred = np.empty((x.shape[1],) + y_block.shape)
sse = np.empty((m - w + 1, n))
# Easiest to loop over columns.
# Could do broadcasting tricks, but noth worth the trouble since number of columns is modest
for i in range(x.shape[0]):
    # Reshape a columns of x like y
    x_block = x[locs, i]
    # Get the parameters and make sure it is 2d with shape (m-w+1, 1)
    # so the broadcasting works
    p = params[:, i][:, None]
    # Get the predicted value
    y_pred[i] = x_block * p
    # And the sse
    sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1)
print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s")
# Some test code
# Test any single observation
start = 124
assert start <= m - w
column = 342
assert column < x.shape[1]
res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit()
np.testing.assert_allclose(res.params[0], params[start, column])
np.testing.assert_allclose(res.fittedvalues, y_pred[column, start])
np.testing.assert_allclose(res.ssr, sse[start, column])
like image 52
Kevin S Avatar answered Sep 29 '22 20:09

Kevin S