I have 3D stacks of masked arrays. I'd like to perform a linear regression for values at each row,col (spatial index) along axis 0 (time). The dimensions of these stacks varies, but a typical shape might be (50, 2000, 2000). My spatially-limited but temporally-dense test case has these dimensions:
stack.ma_stack.shape
(1461, 390, 327)
I did a quick test looping over each row,col:
from scipy.stats.mstats import linregress
#Ordinal dates
x = stack.date_list_o
#Note: idx should be row, col
def sample_lstsq(idx):
b = stack.ma_stack[:, idx[0], idx[1]]
#Note, this is masked stats version
slope, intercept, r_value, p_value, std_err = linregress(x, b)
return slope
out = np.zeros_like(stack.ma_stack[0])
for row in np.arange(stack.ma_stack.shape[1]):
for col in np.arange(stack.ma_stack.shape[2]):
out[row, col] = sample_lstsq((row, col))
This works (slowly). I know there must be a more efficient approach.
I started playing with index arrays and np.vectorize, but I don't think that's actually going to offer any real improvement. I've considered dumping everything to Pandas or attempting to port to Cython, but I'm hoping I can stick with NumPy/SciPy. Or maybe a parallel solution is my best option for improving performance?
Also, has anybody benchmarked the NumPy/SciPy linear regression options? I've come across the following options, but haven't tested myself:
I'm hoping there is an existing approach that offers a significant performance boost without much work for implementation. Thanks.
Edited 12/3/13 @02:29
The approach suggested by @HYRY works perfectly (~15 s runtime) for the sample dataset described above, which is continuous (unmasked) in all dimensions (space and time). However, when passing a masked array contining missing data to np.linalg.leastsq, all masked values are filled with the fill_value (defualt 1E20), which leads to bogus linear fits.
Fortunately, the numpy masked array module has np.ma.polyfit(deg=1), which can handle a 2D y array like np.linalg.leastsq. Looking at the source (https://github.com/numpy/numpy/blob/v1.8.0/numpy/ma/extras.py#L1852), the ma polyfit is just a wrapper for np.polyfit that uses a combined mask from the x and y masks. This works nicely for a 2D y when the locations of the missing data in y are constant.
Unfortunately, my data have variable missing data locations in space and time. Here's an example from another stack:
In [146]: stack.ma_stack.shape
Out [146]: (57, 1889, 1566)
Sampling a single index returns a timeseries with 6 unmasked values:
In [147]: stack.ma_stack[:,0,0]
Out [147]:
masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
519.7779541015625 -- -- -- 518.9047241210938 -- -- -- -- -- -- --
516.6539306640625 516.0836181640625 515.9403686523438 -- -- -- --
514.85205078125 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --],
mask = [ True True True True True True True True True True True True
True True True True True True True True True False True True
True False True True True True True True True False False False
True True True True False True True True True True True True
True True True True True True True True True],
fill_value = 1e+20)
Sampling a different location returns a different number of unmasked values from different time slices:
In [148]: stack.ma_stack[:,1888,1565]
Out[148]:
masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
729.0936889648438 -- -- -- 724.7155151367188 -- -- -- -- -- -- --
722.076171875 720.9276733398438 721.9603881835938 -- 720.3294067382812 --
-- 713.9591064453125 709.8037719726562 707.756103515625 -- -- --
703.662353515625 -- -- -- -- 708.6276245117188 -- -- -- -- --],
mask = [ True True True True True True True True True True True True
True True True True True True True True True False True True
True False True True True True True True True False False False
True False True True False False False True True True False True
True True True False True True True True True],
fill_value = 1e+20)
The minimum number of unmasked values for each index is 6, max is 45. So every location has at least some masked values.
For reference, my x (time ordinals) values are all unmasked:
In [150]: stack.date_list_o
Out[150]:
masked_array(data = [ 733197.64375 733962.64861111 733964.65694444 733996.62361111
733999.64236111 734001.63541667 734033.64305556 734071.64722222
734214.675 734215.65694444 734216.625 734226.64722222
734229.63819444 734232.65694444 734233.67847222 734238.63055556
734238.63055556 734245.65277778 734245.65277778 734255.63125
734255.63125 734307.85 734326.65138889 734348.63888889
734348.63958333 734351.85 734363.70763889 734364.65486111
734390.64722222 734391.63194444 734394.65138889 734407.64652778
734407.64722222 734494.85 734527.85 734582.85
734602.65486111 734664.85555556 734692.64027778 734741.63541667
734747.85 734807.85555556 734884.85555556 734911.65763889
734913.64375 734917.64236111 734928.85555556 734944.71388889
734961.62777778 735016.04583333 735016.62777778 735016.85555556
735036.65347222 735054.04583333 735102.63125 735119.61180556
735140.63263889],
mask = False,
fill_value = 1e+20)
So I reshape stack.ma_stack and run polyfit:
newshape = (stack.ma_stack.shape[0], stack.ma_stack.shape[1]*stack.ma_stack.shape[2])
print newshape
#(57, 2958174)
y = stack.ma_stack.reshape(newshape)
p = np.ma.polyfit(x, y, deg=1)
But by column ~1500, every row in y is "cumulatively" masked and I get some complaints and empty output:
RankWarning: Polyfit may be poorly conditioned
** On entry to DLASCL, parameter number 4 had an illegal value
...
So, it appears that using a 2D y with missing data at different locations won't work. I need a leastsq fit that uses all available unmasked data in each y column. There might be a way to do this by carefully compressing x and y and keeping track of unmasked indices.
Any other ideas? pandas is starting to look like it might be a good solution.
Edited 12/3/13 @20:29
@HYRY offered a solution that works for missing values in the time (axis=0) dimension. I had to modify slightly to handle missing values in the spatial (axes=1,2) dimension. If a particular spatial index only has one unmasked entry in time, we certainly don't want to attempt a linear regression. Here is my implementation:
def linreg(self):
#Only compute where we have n_min unmasked values in time
n_min = 3
valid_idx = self.ma_stack.count(axis=0).filled(0) >= n_min
#Returns 2D array of unmasked columns
y = self.ma_stack[:, valid_idx]
#Extract mask for axis 0 - invert, True where data is available
mask = ~y.mask
#Remove masks, fills with fill_value
y = y.data
#Independent variable is time ordinal
x = self.date_list_o
x = x.data
#Prepare matrices and solve
X = np.c_[x, np.ones_like(x)]
a = np.swapaxes(np.dot(X.T, (X[None, :, :] * mask.T[:, :, None])), 0, 1)
b = np.dot(X.T, (mask*y))
r = np.linalg.solve(a, b.T)
#Create output grid with original dimensions
out = np.ma.masked_all_like(self.ma_stack[0])
#Fill in the valid indices
out[valid_idx] = r[:,0]
Runtime is very fast - only ~5-10 seconds for the array dimensions discussed here.
If i understand it correctly, you want do many linear regression y = k * x + b
, there is only one x
, but many y
, for every y
you want to calculate the k
and b
.
If the shape of x
is 50, y is (50, 1000), you can use numpy.linalg.lstsq
, here is some demo:
import numpy as np
x = np.random.rand(50)
k = np.random.rand(1000)
b = np.random.rand(1000)
y = np.outer(x, k) + b + np.random.normal(size=(50, 1000), scale=1e-10)
r = np.linalg.lstsq(np.c_[x, np.ones_like(x)], y)[0]
print np.allclose(r[0], k)
print np.allclose(r[1], b)
for the y with shape (50, 2000, 2000), you can reshape it to (50, 2000*2000).
Edit
Here the my solution for masked array. The formula is:
Prepare Y as 2-dim array with shape (1889*1566, 57), X as 2-dim array with shape (57, 2). mask as a bool array with the same shape as Y, True means the value in Y is available.
Calculate array a
with shape (1889*1566, 2, 2), b
with shape (1889*1566, 2), then call numpy.linalg.solve(a, b)
, here is some demo code:
import numpy as np
N = 50
M = 1000
x = np.random.rand(N)
X = np.c_[x, np.ones_like(x)]
beta = np.random.rand(M, 2)
Y = np.dot(beta, X.T)
Y += np.random.normal(scale=0.1, size=Y.shape)
mask = np.random.randint(0, 2, size=Y.shape).astype(np.bool)
a = np.swapaxes(np.dot(X.T, (X[None, :, :] * mask[:, :, None])), 0, 1)
b = np.dot(X.T, (mask*Y).T)
beta2 = np.linalg.solve(a, b.T)
i = 123
print "real:", beta[i]
print "by solve:", beta2[i]
m = mask[i]
x2 = X[m]
y2 = Y[i, m]
print "by lstsq:", np.linalg.lstsq(x2, y2)[0]
output 123th coefficient:
real: [ 0.35813131 0.29736779]
by solve: [ 0.38088499 0.30382547]
by lstsq: [ 0.38088499 0.30382547]
You can also calculate the a
array by following code, I think it will use less memory than the method above:
a2 = np.empty((M, 2, 2))
xm = mask * x
a2[:, 0, 0] = (xm*xm).sum(1)
a2[:, 1, 0] = (xm*mask).sum(1)
a2[:, 0, 1] = a2[:, 1, 0]
a2[:, 1, 1] = (mask).sum(1)
print np.allclose(a2, a)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With