Accelerating one-to-many correlation calculations in Python

Tags:

I'd like to calculate the Pearson's correlation coefficient between a vector and each row of an array in Python (numpy and or scipy are assumed). Use of standard correlation matrix calculation functions will not be possible due to the size of the real data arrays and memory constraints. Here's my naive implementation:

import numpy as np
import scipy.stats as sps

np.random.seed(0)

def correlateOneWithMany(one, many):
    """Return Pearson's correlation coef of 'one' with each row of 'many'."""
    pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
    pr_arr[:] = np.nan
    for row_num in np.arange(many.shape[0]):
        pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
    return pr_arr

obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))

pr = correlateOneWithMany(X[0, :], X)

%timeit correlateOneWithMany(X[0, :], X)
# 10 loops, best of 3: 38.9 ms per loop

Any thoughts on accelerating this would be greatly appreciated!

633

asked Jun 01 '16 21:06

dewarrn1

1 Answers

The module scipy.spatial.distance implements the "correlation distance", which is simply one minus the correlation cofficient. You can use the function cdist to compute the one-to-many distances, and get the correlation coefficients by subtracting the result from 1.

Here's a modified version of your script that includes the calculation of the correlation coefficients using cdist:

import numpy as np
import scipy.stats as sps
from scipy.spatial.distance import cdist

np.random.seed(0)

def correlateOneWithMany(one, many):
    """Return Pearson's correlation coef of 'one' with each row of 'many'."""
    pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
    pr_arr[:] = np.nan
    for row_num in np.arange(many.shape[0]):
        pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
    return pr_arr

obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))

pr = correlateOneWithMany(X[0, :], X)

c = 1 - cdist(X[0:1, :], X, metric='correlation')[0]

print(np.allclose(c, pr[:, 0]))

Timing:

In [133]: %timeit correlateOneWithMany(X[0, :], X)
10 loops, best of 3: 37.7 ms per loop

In [134]: %timeit 1 - cdist(X[0:1, :], X, metric='correlation')[0]
1000 loops, best of 3: 1.11 ms per loop

answered Sep 27 '22 00:09

Warren Weckesser

Related questions
                            
                                OpenCV how to smooth contour, reducing noise
                            
                                How to covert a list of lists into dataframe and make the first element of the lists as the index
                            
                                A single string in single quotes with PyYAML
                            
                                Using seaborn barplot to plot wide-form dataframes
                            
                                How can i connect pyRserve with Python
                            
                                Why does separating my module into multiple files make it slower?
                            
                                Bad file descriptor in Python 2.7
                            
                                How can I use mock_open with a Python UnitTest decorator?
                            
                                Anonym password protect pages without username with Flask
                            
                                Virtual Environments: python -m venv VS echo layout python3
                            
                                How can one mark a flag as required with gflags?
                            
                                Download azure blob via stream - Exit 137
                            
                                How to scan for a string literal allowing escaped characters?
                            
                                Is it possible to trigger a mousePressEvent artificially on a QWebView?
                            
                                Determinate if class has user defined __init__
                            
                                How can I declare a Column as a categorical feature in a DataFrame for use in ml
                            
                                What does ${python3:Depends} mean in a debian source-package control file?
                            
                                attributeError: can't set attribute with flask-SQLAlchemy [duplicate]
                            
                                Error Installing Pyproj in Python 3.5
                            
                                Rearrange a pandas data frame to create a 2d ratings matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Accelerating one-to-many correlation calculations in Python

Tags:

python

numpy

python-2.7

statistics

scipy

dewarrn1

People also ask

1 Answers

Warren Weckesser

Recent Activity

Donate For Us