I have a 1D array of independent variable values (x_array
) that match the timesteps in a 3D numpy array of spatial data with multiple time-steps (y_array
). My actual data is much larger: 300+ timesteps and up to 3000 * 3000 pixels:
import numpy as np
from scipy.stats import linregress
# Independent variable: four time-steps of 1-dimensional data
x_array = np.array([0.5, 0.2, 0.4, 0.4])
# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2, -0.2, -0.3],
[-0.3, -0.2, -0.3],
[-0.3, -0.4, -0.4]],
[[-0.2, -0.2, -0.4],
[-0.3, np.nan, -0.3],
[-0.3, -0.3, -0.4]],
[[np.nan, np.nan, -0.3],
[-0.2, -0.3, -0.7],
[-0.3, -0.3, -0.3]],
[[-0.1, -0.3, np.nan],
[-0.2, -0.3, np.nan],
[-0.1, np.nan, np.nan]]])
I want to compute a per-pixel linear regression and obtain R-squared, P-values, intercepts and slopes for each xy
pixel in y_array
, with values for each timestep in x_array
as my independent variable.
I can reshape to get the data in a form to input it into np.polyfit
which is vectorised and fast:
# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)
# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)
However, this ignores pixels that contain any NaN
values (np.polyfit
does not support NaN
values), and does not calculate the statistics I require (R-squared, P-values, intercepts and slopes).
The answer here uses scipy.stats import linregress
which does calculate the statistics I need, and suggests avoiding NaN
issues by masking out these NaN
values. However, this example is for two 1D arrays, and I can't work out how to apply a similar masking approach to my case where each column in y_array_reshaped
will have a different set of NaN
values.
My question: How can I calculate regression statistics for each pixel in a large multidimensional array (300 x 3000 x 3000) containing many NaN
values in a reasonably fast, vectorised way?
Desired result: A 3 x 3 array of regression statistic values (e.g. R-squared) for each pixel in y_array
, even if that pixel contains NaN
values at some point in the time series
This blog post mentioned in the comments above contains an incredibly fast vectorized function for cross-correlation, covariance, and regression for multi-dimensional data in Python. It produces all of the regression outputs I need, and does so in milliseconds as it relies entirely on simple vectorised array operations in xarray
.
https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html
I have made one minor change (first line after #3
) to ensure the function correctly accounts for different numbers of NaN values in each pixel:
def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time.
Thus the input data could be a 1D time series, or for example, have three
dimensions (time,lat,lon).
Datasets can be provided in any order, but note that the regression slope
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value,
and standard error on regression between the two datasets along their
aligned time dimension.
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount.
"""
#1. Ensure that the data are properly alinged to each other.
x,y = xr.align(x,y)
#2. Add lag information if any, and shift the data accordingly
if lagx!=0:
# If x lags y by 1, x must be shifted 1 step backwards.
# But as the 'zero-th' value is nonexistant, xr assigns it as invalid
# (nan). Hence it needs to be dropped
x = x.shift(time = -lagx).dropna(dim='time')
# Next important step is to re-align the two datasets so that y adjusts
# to the changed coordinates of x
x,y = xr.align(x,y)
if lagy!=0:
y = y.shift(time = -lagy).dropna(dim='time')
x,y = xr.align(x,y)
#3. Compute data length, mean and standard deviation along time axis:
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd = x.std(axis=0)
ystd = y.std(axis=0)
#4. Compute covariance along time axis
cov = np.sum((x - xmean)*(y - ymean), axis=0)/(n)
#5. Compute correlation along time axis
cor = cov/(xstd*ystd)
#6. Compute regression slope and intercept:
slope = cov/(xstd**2)
intercept = ymean - xmean*slope
#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats
from scipy.stats import t
pval = t.sf(tstats, n-2)*2
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)
return cov,cor,slope,intercept,pval,stderr
I'm not sure how this would scale up (perhaps you could use dask), but here is a pretty straightforward way to do this with a pandas DataFrame
using the apply method:
import pandas as pd
import numpy as np
from scipy.stats import linregress
# Independent variable: four time-steps of 1-dimensional data
x_array = np.array([0.5, 0.2, 0.4, 0.4])
# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2, -0.2, -0.3],
[-0.3, -0.2, -0.3],
[-0.3, -0.4, -0.4]],
[[-0.2, -0.2, -0.4],
[-0.3, np.nan, -0.3],
[-0.3, -0.3, -0.4]],
[[np.nan, np.nan, -0.3],
[-0.2, -0.3, -0.7],
[-0.3, -0.3, -0.3]],
[[-0.1, -0.3, np.nan],
[-0.2, -0.3, np.nan],
[-0.1, np.nan, np.nan]]])
def lin_regress(col):
"Mask nulls and apply stats.linregress"
col = col.loc[~pd.isnull(col)]
return linregress(col.index.tolist(), col)
# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())
# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)
# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'
print(final_df)
Output:
slope intercept r_value p_value std_err
pixel_number
0 0.071429 -0.192857 0.188982 8.789623e-01 0.371154
1 -0.071429 -0.207143 -0.188982 8.789623e-01 0.371154
2 0.357143 -0.464286 0.944911 2.122956e-01 0.123718
3 0.105263 -0.289474 0.229416 7.705843e-01 0.315789
4 1.000000 -0.700000 1.000000 9.003163e-11 0.000000
5 -0.285714 -0.328571 -0.188982 8.789623e-01 1.484615
6 0.105263 -0.289474 0.132453 8.675468e-01 0.557000
7 -0.285714 -0.228571 -0.755929 4.543711e-01 0.247436
8 0.071429 -0.392857 0.188982 8.789623e-01 0.371154
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With