Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Multiple Simple Linear Regression

Note this is not a question about multiple regression, it is a question about doing simple (single-variable) regression multiple times in Python/NumPy (2.7).

I have two m x n arrays x and y. The rows correspond to each other, and each pair is the set of (x,y) points for a measurement. That is, plt.plot(x.T, y.T, '.') would plot each of m datasets/measurements.

I'm wondering what the best way to perform the m linear regressions is. Currently I loop over the rows and use scipy.stats.linregress(). (Assume I don't want solutions based on doing linear algebra with the matrices but instead want to work with this function, or an equivalent black-box function.) I could try np.vectorize, but the docs indicate it also loops.

With some experimenting, I've also found a way to use list comprehensions with map() and get correct results. I've put both solutions below. In IPython, `%%timeit`` returns, using a small dataset (commented out):

(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop

To try magnifying this, I made a much bigger random dataset (dimension trials x trials):

(loop, trials = 1000)  1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000)   1 loops, best of 3: 256 ms per loop
(map, trials = 10000)  1 loops, best of 3: 2.37 s per loop

That's a decent speedup on a really big set, but I was expecting a bit more. Is there a better way?

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
np.random.seed(42)
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
    yrow = y[k,:]
    slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)
like image 776
schodge Avatar asked Mar 25 '14 21:03

schodge


People also ask

Can we plot multiple linear regression in Python?

Plot the ResultsUse xlabel to label the x-axis and use ylabel to label the y-axis. Regression plot of our model. A regression plot is useful to understand the linear relationship between two parameters. It creates a regression line in-between those parameters and then plots a scatter plot of those data points.

What is simple multiple linear regression?

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.

What is difference between simple and multiple linear regression?

What is difference between simple linear and multiple linear regressions? Simple linear regression has only one x and one y variable. Multiple linear regression has one y and two or more x variables. For instance, when we predict rent based on square feet alone that is simple linear regression.


1 Answers

Single variable linear regression is simple enough to vectorize it manually:

def multiple_linregress(x, y):
    x_mean = np.mean(x, axis=1, keepdims=True)
    x_norm = x - x_mean
    y_mean = np.mean(y, axis=1, keepdims=True)
    y_norm = y - y_mean

    slope = (np.einsum('ij,ij->i', x_norm, y_norm) /
             np.einsum('ij,ij->i', x_norm, x_norm))
    intercept = y_mean[:, 0] - slope * x_mean[:, 0]

    return np.column_stack((slope, intercept))

With some made up data:

m = 1000
n = 1000
x = np.random.rand(m, n)
y = np.random.rand(m, n)

it outperforms your looping options by a fair margin:

%timeit multiple_linregress(x, y)
100 loops, best of 3: 14.1 ms per loop
like image 109
Jaime Avatar answered Sep 18 '22 07:09

Jaime