I'd like to calculate the Pearson's correlation coefficient between a vector and each row of an array in Python (numpy and or scipy are assumed). Use of standard correlation matrix calculation functions will not be possible due to the size of the real data arrays and memory constraints. Here's my naive implementation:
import numpy as np
import scipy.stats as sps
np.random.seed(0)
def correlateOneWithMany(one, many):
"""Return Pearson's correlation coef of 'one' with each row of 'many'."""
pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
pr_arr[:] = np.nan
for row_num in np.arange(many.shape[0]):
pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
return pr_arr
obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))
pr = correlateOneWithMany(X[0, :], X)
%timeit correlateOneWithMany(X[0, :], X)
# 10 loops, best of 3: 38.9 ms per loop
Any thoughts on accelerating this would be greatly appreciated!
The Pearson Correlation coefficient can be computed in Python using corrcoef() method from Numpy. The input for this function is typically a matrix, say of size mxn , where: Each column represents the values of a random variable. Each row represents a single sample of n random variables.
To calculate the correlation between two variables in Python, we can use the Numpy corrcoef() function.
You can plot correlation between two columns of pandas dataframe using sns. regplot(x=df['column_1'], y=df['column_2']) snippet. You can see the correlation of the two columns of the dataframe as a scatterplot.
The pearsonr() SciPy function can be used to calculate the Pearson's correlation coefficient between two data samples with the same length. We can calculate the correlation between the two variables in our test problem.
The module scipy.spatial.distance
implements the "correlation distance", which is simply one minus the correlation cofficient. You can use the function cdist
to compute the one-to-many distances, and get the correlation coefficients by subtracting the result from 1.
Here's a modified version of your script that includes the calculation of the correlation coefficients using cdist
:
import numpy as np
import scipy.stats as sps
from scipy.spatial.distance import cdist
np.random.seed(0)
def correlateOneWithMany(one, many):
"""Return Pearson's correlation coef of 'one' with each row of 'many'."""
pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
pr_arr[:] = np.nan
for row_num in np.arange(many.shape[0]):
pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
return pr_arr
obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))
pr = correlateOneWithMany(X[0, :], X)
c = 1 - cdist(X[0:1, :], X, metric='correlation')[0]
print(np.allclose(c, pr[:, 0]))
Timing:
In [133]: %timeit correlateOneWithMany(X[0, :], X)
10 loops, best of 3: 37.7 ms per loop
In [134]: %timeit 1 - cdist(X[0:1, :], X, metric='correlation')[0]
1000 loops, best of 3: 1.11 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With