In Python I need to find the pairwise correlation between all features in a matrix A
and all features in a matrix B
. In particular, I am interesting in finding the strongest Pearson correlation that a given feature in A
has across all features in B
. I do not care whether the strongest correlation is positive or negative.
I've done a inefficient implementation using two loops and scipy below. However, I'd like to use np.corrcoef
or another similar method to compute it efficiently. Matrix A
has shape 40000x400 and B
has shape 40000x1440. My attempt at doing it efficiently can be seen below as the method find_max_absolute_corr(A,B)
. However, it fails with the following error:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
.
import numpy as np
from scipy.stats import pearsonr
def find_max_absolute_corr(A, B):
""" Finds for each feature in `A` the highest Pearson
correlation across all features in `B`. """
max_corr_A = np.zeros((A.shape[1]))
for A_col in range(A.shape[1]):
print "Calculating {}/{}.".format(A_col+1, A.shape[1])
metric = A[:,A_col]
pearson = np.corrcoef(B, metric, rowvar=0)
# takes negative correlations into account as well
min_p = min(pearson)
max_p = max(pearson)
max_corr_A[A_col] = max_absolute(min_p, max_p)
return max_corr_A
def max_absolute(min_p, max_p):
if np.isnan(min_p) or np.isnan(max_p):
raise ValueError("NaN correlation.")
if abs(max_p) > abs(min_p):
return max_p
else:
return min_p
if __name__ == '__main__':
A = np.array(
[[10, 8.04, 9.14, 7.46],
[8, 6.95, 8.14, 6.77],
[13, 7.58, 8.74, 12.74],
[9, 8.81, 8.77, 7.11],
[11, 8.33, 9.26, 7.81]])
B = np.array(
[[-14, -9.96, 8.10, 8.84, 8, 7.04],
[-6, -7.24, 6.13, 6.08, 5, 5.25],
[-4, -4.26, 3.10, 5.39, 8, 5.56],
[-12, -10.84, 9.13, 8.15, 5, 7.91],
[-7, -4.82, 7.26, 6.42, 8, 6.89]])
# simple, inefficient method
for A_col in range(A.shape[1]):
high_corr = 0
for B_col in range(B.shape[1]):
corr,_ = pearsonr(A[:,A_col], B[:,B_col])
high_corr = max_absolute(high_corr, corr)
print high_corr
# -0.161314601631
# 0.956781516149
# 0.621071009239
# -0.421539304112
# efficient method
max_corr_A = find_max_absolute_corr(A, B)
print max_corr_A
# [-0.161314601631,
# 0.956781516149,
# 0.621071009239,
# -0.421539304112]
This matrix shows the correlation between every single pair of numeric features in the dataset.
To compute the cross-correlation of two matrices, compute and sum the element-by-element products for every offset of the second matrix relative to the first. With several caveats, this can be used to calculate the offset required to get 2 matrices of related values to overlap.
The higher the absolute value is of the correlation coefficient, the stronger is this relationship. Typically, absolute values between 0 and 0.3 are considered weak correlations, 0.4-0.5 are considered moderate, while anything between 0.6 and 1 is treated as a strong correlation.
Pairwise correlations uncover these potential relations of interest. Where associations are detected that, based upon prior knowledge, are judged indicative of relationships worth further study, adjustments for potential confounding variables must be made.
Seems scipy.stats.pearsonr
follows this definition of Pearson Correlation Coefficient Formula applied on column-wise pairs from A
& B
-
Based on that formula, you can vectorized easily as the pairwise computations of columns from A
and B
are independent of each other. Here's one vectorized solution using broadcasting
-
# Get number of rows in either A or B
N = B.shape[0]
# Store columnw-wise in A and B, as they would be used at few places
sA = A.sum(0)
sB = B.sum(0)
# Basically there are four parts in the formula. We would compute them one-by-one
p1 = N*np.einsum('ij,ik->kj',A,B)
p2 = sA*sB[:,None]
p3 = N*((B**2).sum(0)) - (sB**2)
p4 = N*((A**2).sum(0)) - (sA**2)
# Finally compute Pearson Correlation Coefficient as 2D array
pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
# Get the element corresponding to absolute argmax along the columns
out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
Sample run -
1) Inputs :
In [12]: A
Out[12]:
array([[ 10. , 8.04, 9.14, 7.46],
[ 8. , 6.95, 8.14, 6.77],
[ 13. , 7.58, 8.74, 12.74],
[ 9. , 8.81, 8.77, 7.11],
[ 11. , 8.33, 9.26, 7.81]])
In [13]: B
Out[13]:
array([[-14. , -9.96, 8.1 , 8.84, 8. , 7.04],
[ -6. , -7.24, 6.13, 6.08, 5. , 5.25],
[ -4. , -4.26, 3.1 , 5.39, 8. , 5.56],
[-12. , -10.84, 9.13, 8.15, 5. , 7.91],
[ -7. , -4.82, 7.26, 6.42, 8. , 6.89]])
2) Original loopy code run -
In [14]: high_corr_out = np.zeros(A.shape[1])
...: for A_col in range(A.shape[1]):
...: high_corr = 0
...: for B_col in range(B.shape[1]):
...: corr,_ = pearsonr(A[:,A_col], B[:,B_col])
...: high_corr = max_absolute(high_corr, corr)
...: high_corr_out[A_col] = high_corr
...:
In [15]: high_corr_out
Out[15]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779])
3) Proposed code run -
In [16]: N = B.shape[0]
...: sA = A.sum(0)
...: sB = B.sum(0)
...: p1 = N*np.einsum('ij,ik->kj',A,B)
...: p2 = sA*sB[:,None]
...: p3 = N*((B**2).sum(0)) - (sB**2)
...: p4 = N*((A**2).sum(0)) - (sA**2)
...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
...:
In [17]: pcorr # Pearson Correlation Coefficient array
Out[17]:
array([[ 0.41895565, -0.5910935 , -0.40465987, 0.5818286 ],
[ 0.66609445, -0.41950457, 0.02450215, 0.64028344],
[-0.64953314, 0.65669916, 0.30836196, -0.85127779],
[-0.41917583, 0.59043266, 0.40364532, -0.58144102],
[ 0.8067843 , 0.07947386, 0.74016181, 0.53165395],
[-0.1613146 , 0.95678152, 0.62107101, -0.4215393 ]])
In [18]: out # elements corresponding to absolute argmax along columns
Out[18]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779])
Runtime tests -
In [36]: A = np.random.rand(4000,40)
In [37]: B = np.random.rand(4000,144)
In [38]: np.allclose(org_app(A,B),proposed_app(A,B))
Out[38]: True
In [39]: %timeit org_app(A,B) # Original approach
1 loops, best of 3: 1.35 s per loop
In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach
10 loops, best of 3: 39.1 ms per loop
Adding on to the above answer from personal experience,
p1 = N*np.dot(B.T,A)
worked way faster for me when compared to
p1 = N*np.einsum('ij,ik->kj',A,B)
This was especially true when A and B are large dimensional matrices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With