Quickest way to calculate subset of correlation matrix

Question

I'm partial to using pandas builtin corr method for dataframes. However, I am trying to calculate the correlation matrix of a dataframe with 45,000 columns. And then repeat this 250 times. The calculation is crushing my ram (16 GB, mac book pro). I'm grabbing statistics on the columns of the resulting correlation matrix. So I need one column's correlation with every other column to calculate those statistics. My solution is to calculate correlation of a subset of columns with every other column, but I need an efficient way to do this.

Consider:

import pandas as pd
import numpy as np

np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(6, 4), columns=list('ABCD'))
df

enter image description here

I want to calculate the correlations for just ['A', 'B']

corrs = df.corr()[['A', 'B']]
corrs

enter image description here

I'll finish it off by calculating the mean or some other stat.

I can't use the code I used to create the example because when I scale up, I don't have the memory for it. When performing the calculation, it must use an amount of memory proportional to the number of columns chosen to calculate correlations relative to everything else.

I'm looking for the most performant solution at scale. I have a solution, but I'm looking for other ideas to ensure I'm getting the best. Any answer provided that returns the correct answer as shown in the demonstration and satisfies the memory constraint will be upvoted by me (and I'd encourage upvoting amongst each other as well).

Below is my code:

def corr(df, k=0, l=10):
    d = df.values - df.values.mean(0)
    d_ = d[:, k:l]
    s = d.std(0, keepdims=True)
    return pd.DataFrame(d.T.dot(d[:, k:l]) / s.T.dot(s[:, k:l]) / d.shape[0],
                        df.columns, df.columns[k:l])

user20160 · Accepted Answer

Using dot products to compute the correlation (as in your example) seems like a good approach. I'll describe two improvements, then code implementing them.

Improvement 1: Pull means out of dot product

We can pull the means out of the dot product, to avoid having to subtract them from every value (similar to how you pulled the standard deviations out of the dot product, which we'll also do).

Let x, y be vectors with n elements. Let a, b be scalars. Let <x,y> denote the dot product between x and y.

The correlation between x and y can be expressed using the dot product

<(x-mean(x))/std(x), (y-mean(y))/std(y)> / n

To pull the standard deviations out of the dot product, we can use the following identity (as you did above):

<ax, by> = a*b*<x, y>

To pull the means out of the dot product, we can derive another identity:

<x+a, y+b> = <x,y> + a*sum(y) + b*sum(x) + a*b*n

In the case where a = -mean(x), b = -mean(y), this simplifies to:

<x-mean(x), y-mean(y)> = <x, y> - sum(x)*sum(y)/n

Using these identities, the correlation between x and y is equivalent to:

(<x, y> - sum(x)*sum(y)/n) / (std(x)*std(y)*n)

In the function below, this will be expressed using matrix multiplication and outer products to handle multiple variables simultaneously (as in your example).

Improvement 2: Pre-compute sums and standard deviations

We can pre-compute the sums and standard deviations, to avoid re-computing them for all columns every time the function is called.

Code

Putting the two improvements together, we have the following (I don't speak pandas, so it's in numpy):

def corr_cols(x, xsum, xstd, lo, hi):
    n = x.shape[0]

    return (
        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
        / (np.outer(xstd, xstd[lo:hi])*n)
    )

# fake data w/ 10 points, 5 dimensions
x = np.random.rand(10, 5)

# precompute sums and standard deviations along each dimension
xsum = np.sum(x, 0)
xstd = np.std(x, 0)

# calculate columns of correlation matrix for dimensions 1 thru 3
r = corr_cols(x, xsum, xstd, 1, 4)

Better code

Pre-computing and storing the sums and standard deviations can be hidden inside a closure, to give a nicer interface and keep the main code cleaner. Functionally, the operations are equivalent to the previous code.

def col_correlator(x):
    n = x.shape[0]
    xsum = np.sum(x, 0)
    xstd = np.std(x, 0)

    return lambda lo, hi: (
        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
        / (np.outer(xstd, xstd[lo:hi])*n)
    )

# construct function to compute columns of correlation matrix
cc = col_correlator(x)

# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(1, 4)

EDIT: (piRSquared)

I wanted to put my edit in this post to further encourage upvoting of this answer.

This is the code I implemented utilizing this advice. This solution translates back and forth between pandas and numpy.

def corr_closure(df):
    d = df.values
    sums = d.sum(0, keepdims=True)
    stds = d.std(0, keepdims=True)
    n = d.shape[0]

    def corr(k=0, l=10):
        d2 = d.T.dot(d[:, k:l])
        sums2 = sums.T.dot(sums[:, k:l])
        stds2 = stds.T.dot(stds[:, k:l])

        return pd.DataFrame((d2 - sums2 / n) / stds2 / n,
                            df.columns, df.columns[k:l])

    return corr

Use case:

corr = corr_closure(df)

corr(0, 2)

enter image description here

Ofir Shifman · Answer

Credit goes to @user20160, @piRsquared.

I had a very similar issue. I was trying to compute just a quarter of the matrix: the correlations between some group of columns to another one.

I modified the code a bit, that it takes 4 parameters, for the 2 groups of vectors:

def col_correlator(x):
n = x.shape[0]
xsum = np.sum(x, 0)
xstd = np.std(x, 0)

return lambda lo_c, hi_c, lo_r, hi_r: (
    (np.dot(x[:, lo_r:hi_r].T, x[:, lo_c:hi_c]) - np.outer(xsum[lo_r:hi_r], xsum[lo_c:hi_c]) / n)
    / (np.outer(xstd[lo_r:hi_r], xstd[lo_c:hi_c]) * n)
)

# construct function to compute columns of correlation matrix
cc = col_correlator(x)

# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(n, m,0,n)

Quickest way to calculate subset of correlation matrix

Tags:

python

pandas

numpy

piRSquared

2 Answers

Improvement 1: Pull means out of dot product

Improvement 2: Pre-compute sums and standard deviations

Code

Better code

EDIT: (piRSquared)

user20160

Ofir Shifman

Recent Activity

Donate For Us

Quickest way to calculate subset of correlation matrix

Tags:

python

pandas

numpy

piRSquared

2 Answers

Improvement 1: Pull means out of dot product

Improvement 2: Pre-compute sums and standard deviations

Code

Better code

EDIT: (piRSquared)

user20160

Ofir Shifman

Related questions

Recent Activity

Donate For Us