Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the correlation matrix

I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:

for i in xrange(rows): # rows are the number of rows in the matrix. 
    for j in xrange(i, rows):
        r = scipy.stats.pearsonr(data[i,:], data[j,:])
        print r  

Please note that I am making use of the pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).

My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?

Thanks!

like image 296
batman08 Avatar asked Aug 09 '10 05:08

batman08


2 Answers

Have you tried just using numpy.corrcoef? Seeing as how you're not using the p-values, it should do exactly what you want, with as little fuss as possible. (Unless I'm mis-remembering exactly what pearson's R is, which is quite possible.)

Just quickly checking the results on random data, it returns exactly the same thing as @Justin Peel's code above and runs ~100x faster.

For example, testing things with 1000 rows and 10 columns of random data...:

import numpy as np
import scipy as sp
import scipy.stats

def main():
    data = np.random.random((1000, 10))
    x = corrcoef_test(data)
    y = justin_peel_test(data)
    print 'Maximum difference between the two results:', np.abs((x-y)).max()
    return data

def corrcoef_test(data):
    """Just using numpy's built-in function"""
    return np.corrcoef(data)

def justin_peel_test(data):
    """Justin Peel's suggestion above"""
    rows = data.shape[0]

    r = np.zeros((rows,rows))
    ms = data.mean(axis=1)

    datam = np.zeros_like(data)
    for i in xrange(rows):
        datam[i] = data[i] - ms[i]
    datass = sp.stats.ss(datam,axis=1)
    for i in xrange(rows):
        for j in xrange(i,rows):
            r_num = np.add.reduce(datam[i]*datam[j])
            r_den = np.sqrt(datass[i]*datass[j])
            r[i,j] = min((r_num / r_den), 1.0)
            r[j,i] = r[i,j]
    return r

data = main()

Yields a maximum absolute difference of ~3.3e-16 between the two results

And timings:

In [44]: %timeit corrcoef_test(data)
10 loops, best of 3: 71.7 ms per loop

In [45]: %timeit justin_peel_test(data)
1 loops, best of 3: 6.5 s per loop

numpy.corrcoef should do just what you want, and it's a lot faster.

like image 103
Joe Kington Avatar answered Nov 19 '22 23:11

Joe Kington


New Solution

After looking at Joe Kington's answer, I decided to look into the corrcoef() code and was inspired by it to do the following implementation.

ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
    temp = np.dot(datam[i:],datam[i].T)
    rs = temp / (datass[i:]*datass[i])

Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.

Old Solution

First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.

Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr() so let's scratch that too. Using the below code:

r = np.zeros((rows,rows))
ms = data.mean(axis=1)

datam = np.zeros_like(data)
for i in xrange(rows):
    datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
    for j in xrange(i,rows):
        r_num = np.add.reduce(datam[i]*datam[j])
        r_den = np.sqrt(datass[i]*datass[j])
        r[i,j] = min((r_num / r_den), 1.0)

I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.

Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.

You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.

like image 33
Justin Peel Avatar answered Nov 19 '22 21:11

Justin Peel