Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?

I have a 1D array of independent variable values (x_array) that match the timesteps in a 3D numpy array of spatial data with multiple time-steps (y_array). My actual data is much larger: 300+ timesteps and up to 3000 * 3000 pixels:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

I want to compute a per-pixel linear regression and obtain R-squared, P-values, intercepts and slopes for each xy pixel in y_array, with values for each timestep in x_array as my independent variable.

I can reshape to get the data in a form to input it into np.polyfit which is vectorised and fast:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

However, this ignores pixels that contain any NaN values (np.polyfit does not support NaN values), and does not calculate the statistics I require (R-squared, P-values, intercepts and slopes).

The answer here uses scipy.stats import linregress which does calculate the statistics I need, and suggests avoiding NaN issues by masking out these NaN values. However, this example is for two 1D arrays, and I can't work out how to apply a similar masking approach to my case where each column in y_array_reshaped will have a different set of NaN values.

My question: How can I calculate regression statistics for each pixel in a large multidimensional array (300 x 3000 x 3000) containing many NaN values in a reasonably fast, vectorised way?

Desired result: A 3 x 3 array of regression statistic values (e.g. R-squared) for each pixel in y_array, even if that pixel contains NaN values at some point in the time series

like image 796
Robbi Bishop-Taylor Avatar asked Aug 31 '18 04:08

Robbi Bishop-Taylor


2 Answers

This blog post mentioned in the comments above contains an incredibly fast vectorized function for cross-correlation, covariance, and regression for multi-dimensional data in Python. It produces all of the regression outputs I need, and does so in milliseconds as it relies entirely on simple vectorised array operations in xarray.

https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

I have made one minor change (first line after #3) to ensure the function correctly accounts for different numbers of NaN values in each pixel:

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr
like image 159
Robbi Bishop-Taylor Avatar answered Oct 18 '22 14:10

Robbi Bishop-Taylor


I'm not sure how this would scale up (perhaps you could use dask), but here is a pretty straightforward way to do this with a pandas DataFrame using the apply method:

import pandas as pd
import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

def lin_regress(col):
    "Mask nulls and apply stats.linregress"
    col = col.loc[~pd.isnull(col)]
    return linregress(col.index.tolist(), col)

# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())

# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)

# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'

print(final_df)

Output:

                 slope  intercept   r_value       p_value   std_err
pixel_number                                                       
0             0.071429  -0.192857  0.188982  8.789623e-01  0.371154
1            -0.071429  -0.207143 -0.188982  8.789623e-01  0.371154
2             0.357143  -0.464286  0.944911  2.122956e-01  0.123718
3             0.105263  -0.289474  0.229416  7.705843e-01  0.315789
4             1.000000  -0.700000  1.000000  9.003163e-11  0.000000
5            -0.285714  -0.328571 -0.188982  8.789623e-01  1.484615
6             0.105263  -0.289474  0.132453  8.675468e-01  0.557000
7            -0.285714  -0.228571 -0.755929  4.543711e-01  0.247436
8             0.071429  -0.392857  0.188982  8.789623e-01  0.371154
like image 37
cosmic_inquiry Avatar answered Oct 18 '22 14:10

cosmic_inquiry