Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

Tags:

I am performing component wise regression on a time series data. This is basically where instead of regressing y against x₁, x₂, ..., x_N, we would regress y against x₁ only, y against x₂ only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against x_i (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.

The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.

Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against x_i linear regression as such. The z_matrix is the data matrix and the i represents the i^th column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500.

    ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit()
    return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]

I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower:

    # slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions
    params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog)
    y_hat = self.z_matrix[:, [i]]@params
    ssr = sum((y_hat-endog)**2)
    return params, ssr, y_hat

Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!

978

asked Jun 26 '20 09:06

Lim Kaizhuo

1 Answers

Here is one way since you are always running regressions without a constant. This code runs around 900K models in about 0.5s. It retains the sse, the predicted values for each of the 900K regressions, and the estimated parameters.

The big idea is to exploit the math behind regressions of one variable on another, which is the ratio of a cross-product to an inner product (which the model does not contain a constant). This could be modified to also include a constant by using a moving window demean to estimate the intercept.

import numpy as np
from statsmodels.regression.linear_model import OLS
import datetime

gen = np.random.default_rng(20210514)

# Number of observations
n = 1000
# Number of predictors
m = 1000
# Window size
w = 100
# Simulate data
y = gen.standard_normal((n, 1))
x = gen.standard_normal((n, m))

now = datetime.datetime.now()
# Compute rolling covariance and variance-like terms
# These assume the model is y = x*b + e w/o a constant
c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)]
v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)]
c_trimmed = c[w:] - c[:-w]
v_trimmed = v[w:] - v[:-w]
# Parameters are just the ratio
params = c_trimmed / v_trimmed

# Build a selector array to quickly reshape y and the columns of x
step = np.arange(m - w + 1)
sel = np.arange(w)
locs = step[:, None] + sel
# Get the blocked reshape of y. It has n - w + 1 rows with window observations
# and looks like
# [[y[0],y[1],...,y[99]],
#  [y[1],y[2],...,y[100]],
#  ...,
#  [y[900],y[901],...,y[999]],
y_block = y[locs, 0]
# Storage for the predicted values and the sse
y_pred = np.empty((x.shape[1],) + y_block.shape)
sse = np.empty((m - w + 1, n))
# Easiest to loop over columns.
# Could do broadcasting tricks, but noth worth the trouble since number of columns is modest
for i in range(x.shape[0]):
    # Reshape a columns of x like y
    x_block = x[locs, i]
    # Get the parameters and make sure it is 2d with shape (m-w+1, 1)
    # so the broadcasting works
    p = params[:, i][:, None]
    # Get the predicted value
    y_pred[i] = x_block * p
    # And the sse
    sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1)
print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s")
# Some test code
# Test any single observation
start = 124
assert start <= m - w
column = 342
assert column < x.shape[1]
res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit()
np.testing.assert_allclose(res.params[0], params[start, column])
np.testing.assert_allclose(res.fittedvalues, y_pred[column, start])
np.testing.assert_allclose(res.ssr, sse[start, column])

answered Sep 29 '22 20:09

Kevin S

Related questions
                            
                                Error - b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'
                            
                                How do I include a built-in django widget template in my custom widget template?
                            
                                Python Google Images download does not work
                            
                                How to set column width to bestFit in openpyxl
                            
                                Inputting an obscure file type into tensorflow
                            
                                Comparing two pandas dataframes with different integer types
                            
                                Correct way to re-export modules from __init__.py
                            
                                Is there any Command to Download data from particular folder from Kaggle Competition using kaggle API
                            
                                Using type aliases in docstrings
                            
                                Sphinx: how to cross-reference a target generated by a custom directive
                            
                                Scrapy Pyinstaller OSError: could not get source code / twisted.internet.defer._DefGen_Return
                            
                                get index of concatenated range list
                            
                                any workaround to add token authorization decorator to endpoint at swagger python server stub
                            
                                How to retrieve the labels used in a segmentation mask in AWS Sagemaker
                            
                                How can I implement port forwarding in a Paramiko server?
                            
                                Set output matrix on numpy binomial function
                            
                                How to store result of an operation (like TOPK) per epoch in keras
                            
                                Django - GeoDjango read coordinates in the wrong order
                            
                                How to convert _io.TextIOWrapper to string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

Tags:

python

numpy

linear-regression

statsmodels

regression

Lim Kaizhuo

People also ask

1 Answers

Kevin S

Recent Activity

Donate For Us