I'm partial to using pandas builtin corr
method for dataframes. However, I am trying to calculate the correlation matrix of a dataframe with 45,000 columns. And then repeat this 250 times. The calculation is crushing my ram (16 GB, mac book pro). I'm grabbing statistics on the columns of the resulting correlation matrix. So I need one column's correlation with every other column to calculate those statistics. My solution is to calculate correlation of a subset of columns with every other column, but I need an efficient way to do this.
Consider:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(6, 4), columns=list('ABCD'))
df
I want to calculate the correlations for just ['A', 'B']
corrs = df.corr()[['A', 'B']]
corrs
I'll finish it off by calculating the mean or some other stat.
I can't use the code I used to create the example because when I scale up, I don't have the memory for it. When performing the calculation, it must use an amount of memory proportional to the number of columns chosen to calculate correlations relative to everything else.
I'm looking for the most performant solution at scale. I have a solution, but I'm looking for other ideas to ensure I'm getting the best. Any answer provided that returns the correct answer as shown in the demonstration and satisfies the memory constraint will be upvoted by me (and I'd encourage upvoting amongst each other as well).
Below is my code:
def corr(df, k=0, l=10):
d = df.values - df.values.mean(0)
d_ = d[:, k:l]
s = d.std(0, keepdims=True)
return pd.DataFrame(d.T.dot(d[:, k:l]) / s.T.dot(s[:, k:l]) / d.shape[0],
df.columns, df.columns[k:l])
Using dot products to compute the correlation (as in your example) seems like a good approach. I'll describe two improvements, then code implementing them.
We can pull the means out of the dot product, to avoid having to subtract them from every value (similar to how you pulled the standard deviations out of the dot product, which we'll also do).
Let x, y
be vectors with n
elements. Let a, b
be scalars. Let <x,y>
denote the dot product between x and y.
The correlation between x
and y
can be expressed using the dot product
<(x-mean(x))/std(x), (y-mean(y))/std(y)> / n
To pull the standard deviations out of the dot product, we can use the following identity (as you did above):
<ax, by> = a*b*<x, y>
To pull the means out of the dot product, we can derive another identity:
<x+a, y+b> = <x,y> + a*sum(y) + b*sum(x) + a*b*n
In the case where a = -mean(x), b = -mean(y)
, this simplifies to:
<x-mean(x), y-mean(y)> = <x, y> - sum(x)*sum(y)/n
Using these identities, the correlation between x
and y
is equivalent to:
(<x, y> - sum(x)*sum(y)/n) / (std(x)*std(y)*n)
In the function below, this will be expressed using matrix multiplication and outer products to handle multiple variables simultaneously (as in your example).
We can pre-compute the sums and standard deviations, to avoid re-computing them for all columns every time the function is called.
Putting the two improvements together, we have the following (I don't speak pandas, so it's in numpy):
def corr_cols(x, xsum, xstd, lo, hi):
n = x.shape[0]
return (
(np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
/ (np.outer(xstd, xstd[lo:hi])*n)
)
# fake data w/ 10 points, 5 dimensions
x = np.random.rand(10, 5)
# precompute sums and standard deviations along each dimension
xsum = np.sum(x, 0)
xstd = np.std(x, 0)
# calculate columns of correlation matrix for dimensions 1 thru 3
r = corr_cols(x, xsum, xstd, 1, 4)
Pre-computing and storing the sums and standard deviations can be hidden inside a closure, to give a nicer interface and keep the main code cleaner. Functionally, the operations are equivalent to the previous code.
def col_correlator(x):
n = x.shape[0]
xsum = np.sum(x, 0)
xstd = np.std(x, 0)
return lambda lo, hi: (
(np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)
/ (np.outer(xstd, xstd[lo:hi])*n)
)
# construct function to compute columns of correlation matrix
cc = col_correlator(x)
# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(1, 4)
I wanted to put my edit in this post to further encourage upvoting of this answer.
This is the code I implemented utilizing this advice. This solution translates back and forth between pandas and numpy.
def corr_closure(df):
d = df.values
sums = d.sum(0, keepdims=True)
stds = d.std(0, keepdims=True)
n = d.shape[0]
def corr(k=0, l=10):
d2 = d.T.dot(d[:, k:l])
sums2 = sums.T.dot(sums[:, k:l])
stds2 = stds.T.dot(stds[:, k:l])
return pd.DataFrame((d2 - sums2 / n) / stds2 / n,
df.columns, df.columns[k:l])
return corr
Use case:
corr = corr_closure(df)
corr(0, 2)
Credit goes to @user20160, @piRsquared.
I had a very similar issue. I was trying to compute just a quarter of the matrix: the correlations between some group of columns to another one.
I modified the code a bit, that it takes 4 parameters, for the 2 groups of vectors:
def col_correlator(x):
n = x.shape[0]
xsum = np.sum(x, 0)
xstd = np.std(x, 0)
return lambda lo_c, hi_c, lo_r, hi_r: (
(np.dot(x[:, lo_r:hi_r].T, x[:, lo_c:hi_c]) - np.outer(xsum[lo_r:hi_r], xsum[lo_c:hi_c]) / n)
/ (np.outer(xstd[lo_r:hi_r], xstd[lo_c:hi_c]) * n)
)
# construct function to compute columns of correlation matrix
cc = col_correlator(x)
# compute columns of correlation matrix for dimensions 1 thru 3
r = cc(n, m,0,n)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With