Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accelerating one-to-many correlation calculations in Python

I'd like to calculate the Pearson's correlation coefficient between a vector and each row of an array in Python (numpy and or scipy are assumed). Use of standard correlation matrix calculation functions will not be possible due to the size of the real data arrays and memory constraints. Here's my naive implementation:

import numpy as np
import scipy.stats as sps

np.random.seed(0)

def correlateOneWithMany(one, many):
    """Return Pearson's correlation coef of 'one' with each row of 'many'."""
    pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
    pr_arr[:] = np.nan
    for row_num in np.arange(many.shape[0]):
        pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
    return pr_arr

obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))

pr = correlateOneWithMany(X[0, :], X)

%timeit correlateOneWithMany(X[0, :], X)
# 10 loops, best of 3: 38.9 ms per loop

Any thoughts on accelerating this would be greatly appreciated!

like image 633
dewarrn1 Avatar asked Jun 01 '16 21:06

dewarrn1


People also ask

How is correlation calculated in Python?

The Pearson Correlation coefficient can be computed in Python using corrcoef() method from Numpy. The input for this function is typically a matrix, say of size mxn , where: Each column represents the values of a random variable. Each row represents a single sample of n random variables.

How do you find the correlation between all variables in Python?

To calculate the correlation between two variables in Python, we can use the Numpy corrcoef() function.

How do you plot multiple correlations in Python?

You can plot correlation between two columns of pandas dataframe using sns. regplot(x=df['column_1'], y=df['column_2']) snippet. You can see the correlation of the two columns of the dataframe as a scatterplot.

What is Pearsonr Python?

The pearsonr() SciPy function can be used to calculate the Pearson's correlation coefficient between two data samples with the same length. We can calculate the correlation between the two variables in our test problem.


1 Answers

The module scipy.spatial.distance implements the "correlation distance", which is simply one minus the correlation cofficient. You can use the function cdist to compute the one-to-many distances, and get the correlation coefficients by subtracting the result from 1.

Here's a modified version of your script that includes the calculation of the correlation coefficients using cdist:

import numpy as np
import scipy.stats as sps
from scipy.spatial.distance import cdist

np.random.seed(0)

def correlateOneWithMany(one, many):
    """Return Pearson's correlation coef of 'one' with each row of 'many'."""
    pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
    pr_arr[:] = np.nan
    for row_num in np.arange(many.shape[0]):
        pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
    return pr_arr

obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))

pr = correlateOneWithMany(X[0, :], X)

c = 1 - cdist(X[0:1, :], X, metric='correlation')[0]

print(np.allclose(c, pr[:, 0]))

Timing:

In [133]: %timeit correlateOneWithMany(X[0, :], X)
10 loops, best of 3: 37.7 ms per loop

In [134]: %timeit 1 - cdist(X[0:1, :], X, metric='correlation')[0]
1000 loops, best of 3: 1.11 ms per loop
like image 50
Warren Weckesser Avatar answered Sep 27 '22 00:09

Warren Weckesser