Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient 1D linear regression for each element of 3D numpy array

I have 3D stacks of masked arrays. I'd like to perform a linear regression for values at each row,col (spatial index) along axis 0 (time). The dimensions of these stacks varies, but a typical shape might be (50, 2000, 2000). My spatially-limited but temporally-dense test case has these dimensions:

stack.ma_stack.shape

(1461, 390, 327)

I did a quick test looping over each row,col:

from scipy.stats.mstats import linregress
#Ordinal dates
x = stack.date_list_o
#Note: idx should be row, col
def sample_lstsq(idx):
    b = stack.ma_stack[:, idx[0], idx[1]]
    #Note, this is masked stats version
    slope, intercept, r_value, p_value, std_err = linregress(x, b)
    return slope

out = np.zeros_like(stack.ma_stack[0])
for row in np.arange(stack.ma_stack.shape[1]):
    for col in np.arange(stack.ma_stack.shape[2]):
        out[row, col] = sample_lstsq((row, col))

This works (slowly). I know there must be a more efficient approach.

I started playing with index arrays and np.vectorize, but I don't think that's actually going to offer any real improvement. I've considered dumping everything to Pandas or attempting to port to Cython, but I'm hoping I can stick with NumPy/SciPy. Or maybe a parallel solution is my best option for improving performance?

Also, has anybody benchmarked the NumPy/SciPy linear regression options? I've come across the following options, but haven't tested myself:

  • scipy.stats.linregress
  • numpy.linalg.leastsq
  • numpy.polyfit(deg=1)

I'm hoping there is an existing approach that offers a significant performance boost without much work for implementation. Thanks.


Edited 12/3/13 @02:29

The approach suggested by @HYRY works perfectly (~15 s runtime) for the sample dataset described above, which is continuous (unmasked) in all dimensions (space and time). However, when passing a masked array contining missing data to np.linalg.leastsq, all masked values are filled with the fill_value (defualt 1E20), which leads to bogus linear fits.

Fortunately, the numpy masked array module has np.ma.polyfit(deg=1), which can handle a 2D y array like np.linalg.leastsq. Looking at the source (https://github.com/numpy/numpy/blob/v1.8.0/numpy/ma/extras.py#L1852), the ma polyfit is just a wrapper for np.polyfit that uses a combined mask from the x and y masks. This works nicely for a 2D y when the locations of the missing data in y are constant.

Unfortunately, my data have variable missing data locations in space and time. Here's an example from another stack:

In [146]: stack.ma_stack.shape
Out [146]: (57, 1889, 1566)

Sampling a single index returns a timeseries with 6 unmasked values:

In [147]: stack.ma_stack[:,0,0]
Out [147]: 
masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 519.7779541015625 -- -- -- 518.9047241210938 -- -- -- -- -- -- --
 516.6539306640625 516.0836181640625 515.9403686523438 -- -- -- --
 514.85205078125 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --],
             mask = [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True False  True  True
  True False  True  True  True  True  True  True  True False False False
  True  True  True  True False  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True],
       fill_value = 1e+20)

Sampling a different location returns a different number of unmasked values from different time slices:

In [148]: stack.ma_stack[:,1888,1565]
Out[148]:
masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 729.0936889648438 -- -- -- 724.7155151367188 -- -- -- -- -- -- --
 722.076171875 720.9276733398438 721.9603881835938 -- 720.3294067382812 --
 -- 713.9591064453125 709.8037719726562 707.756103515625 -- -- --
 703.662353515625 -- -- -- -- 708.6276245117188 -- -- -- -- --],
             mask = [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True False  True  True
  True False  True  True  True  True  True  True  True False False False
  True False  True  True False False False  True  True  True False  True
  True  True  True False  True  True  True  True  True],
       fill_value = 1e+20)

The minimum number of unmasked values for each index is 6, max is 45. So every location has at least some masked values.

For reference, my x (time ordinals) values are all unmasked:

In [150]: stack.date_list_o
Out[150]:
masked_array(data = [ 733197.64375     733962.64861111  733964.65694444  733996.62361111
  733999.64236111  734001.63541667  734033.64305556  734071.64722222
  734214.675       734215.65694444  734216.625       734226.64722222
  734229.63819444  734232.65694444  734233.67847222  734238.63055556
  734238.63055556  734245.65277778  734245.65277778  734255.63125
  734255.63125     734307.85        734326.65138889  734348.63888889
  734348.63958333  734351.85        734363.70763889  734364.65486111
  734390.64722222  734391.63194444  734394.65138889  734407.64652778
  734407.64722222  734494.85        734527.85        734582.85
  734602.65486111  734664.85555556  734692.64027778  734741.63541667
  734747.85        734807.85555556  734884.85555556  734911.65763889
  734913.64375     734917.64236111  734928.85555556  734944.71388889
  734961.62777778  735016.04583333  735016.62777778  735016.85555556
  735036.65347222  735054.04583333  735102.63125     735119.61180556
  735140.63263889],
             mask = False,
       fill_value = 1e+20)

So I reshape stack.ma_stack and run polyfit:

newshape = (stack.ma_stack.shape[0], stack.ma_stack.shape[1]*stack.ma_stack.shape[2])
print newshape
#(57, 2958174)
y = stack.ma_stack.reshape(newshape)
p = np.ma.polyfit(x, y, deg=1)

But by column ~1500, every row in y is "cumulatively" masked and I get some complaints and empty output:

RankWarning: Polyfit may be poorly conditioned
** On entry to DLASCL, parameter number  4 had an illegal value
...

So, it appears that using a 2D y with missing data at different locations won't work. I need a leastsq fit that uses all available unmasked data in each y column. There might be a way to do this by carefully compressing x and y and keeping track of unmasked indices.

Any other ideas? pandas is starting to look like it might be a good solution.


Edited 12/3/13 @20:29

@HYRY offered a solution that works for missing values in the time (axis=0) dimension. I had to modify slightly to handle missing values in the spatial (axes=1,2) dimension. If a particular spatial index only has one unmasked entry in time, we certainly don't want to attempt a linear regression. Here is my implementation:

def linreg(self):
    #Only compute where we have n_min unmasked values in time
    n_min = 3
    valid_idx = self.ma_stack.count(axis=0).filled(0) >= n_min
    #Returns 2D array of unmasked columns
    y = self.ma_stack[:, valid_idx]

    #Extract mask for axis 0 - invert, True where data is available
    mask = ~y.mask
    #Remove masks, fills with fill_value
    y = y.data
    #Independent variable is time ordinal
    x = self.date_list_o
    x = x.data

    #Prepare matrices and solve
    X = np.c_[x, np.ones_like(x)]
    a = np.swapaxes(np.dot(X.T, (X[None, :, :] * mask.T[:, :, None])), 0, 1)
    b = np.dot(X.T, (mask*y))
    r = np.linalg.solve(a, b.T)

    #Create output grid with original dimensions
    out = np.ma.masked_all_like(self.ma_stack[0])
    #Fill in the valid indices
    out[valid_idx] = r[:,0]

Runtime is very fast - only ~5-10 seconds for the array dimensions discussed here.

like image 214
David Shean Avatar asked Dec 03 '13 05:12

David Shean


1 Answers

If i understand it correctly, you want do many linear regression y = k * x + b, there is only one x, but many y, for every y you want to calculate the k and b.

If the shape of x is 50, y is (50, 1000), you can use numpy.linalg.lstsq, here is some demo:

import numpy as np

x = np.random.rand(50)
k = np.random.rand(1000)
b = np.random.rand(1000)

y = np.outer(x, k) + b + np.random.normal(size=(50, 1000), scale=1e-10)

r = np.linalg.lstsq(np.c_[x, np.ones_like(x)], y)[0]

print np.allclose(r[0], k)
print np.allclose(r[1], b)

for the y with shape (50, 2000, 2000), you can reshape it to (50, 2000*2000).

Edit

Here the my solution for masked array. The formula is:

enter image description here

Prepare Y as 2-dim array with shape (1889*1566, 57), X as 2-dim array with shape (57, 2). mask as a bool array with the same shape as Y, True means the value in Y is available.

Calculate array a with shape (1889*1566, 2, 2), b with shape (1889*1566, 2), then call numpy.linalg.solve(a, b), here is some demo code:

import numpy as np

N = 50
M = 1000

x = np.random.rand(N)
X = np.c_[x, np.ones_like(x)]
beta = np.random.rand(M, 2)
Y = np.dot(beta, X.T)
Y += np.random.normal(scale=0.1, size=Y.shape)
mask = np.random.randint(0, 2, size=Y.shape).astype(np.bool)

a = np.swapaxes(np.dot(X.T, (X[None, :, :] * mask[:, :, None])), 0, 1)
b = np.dot(X.T, (mask*Y).T)
beta2 = np.linalg.solve(a, b.T)

i = 123
print "real:", beta[i]
print "by solve:", beta2[i]

m = mask[i]
x2 = X[m]
y2 = Y[i, m]
print "by lstsq:", np.linalg.lstsq(x2, y2)[0]

output 123th coefficient:

real: [ 0.35813131  0.29736779]
by solve: [ 0.38088499  0.30382547]
by lstsq: [ 0.38088499  0.30382547]

You can also calculate the a array by following code, I think it will use less memory than the method above:

a2 = np.empty((M, 2, 2))
xm = mask * x
a2[:, 0, 0] = (xm*xm).sum(1)
a2[:, 1, 0] = (xm*mask).sum(1)
a2[:, 0, 1] = a2[:, 1, 0]
a2[:, 1, 1] = (mask).sum(1)

print np.allclose(a2, a)
like image 92
HYRY Avatar answered Oct 13 '22 00:10

HYRY