Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to see common observation counts for Python Pandas correlation matrix entries

Suppose I have a pandas.DataFrame called df. The columns of df represent different individuals and the index axis represents time, so the (i,j) entry is individual j's observation for time period i, and we can assume all data are float type possibly with NaN values.

In my case, I have about 14,000 columns and a few hundred rows.

pandas.corr will give me back the 14,000-by-14,000 correlation matrix and it's time performance is fine for my application.

But I would also like to know, for each pair of individuals (j_1, j_2), how many non-null observations went into the correlation calculation, so I can isolate correlation cells that suffer from poor data coverage.

The best I've been able to come up with is the following:

not_null_locations = pandas.notnull(df).values.astype(int)
common_obs = pandas.DataFrame(not_null_locations.T.dot(not_null_locations),
                              columns=df.columns, index=df.columns)

The memory footprint and speed of this begin to be a bit problematic.

Is there any faster way to get at the common observations with pandas?

like image 793
ely Avatar asked Aug 14 '13 14:08

ely


2 Answers

You can do this, but would need to cythonize (otherwise much slower); however memory footprint should be better (this gives the number of nan observations, your gives number of valid observations, but easily convertible)

l = len(df.columns)
results = np.zeros((l,l))
mask = pd.isnull(df)
for i, ac in enumerate(df):
    for j, bc in enumerate(df):
           results[j,i] = (mask[i] & mask[j]).sum()
results = DataFrame(results,index=df.columns,columns=df.columns)
like image 142
Jeff Avatar answered Sep 24 '22 06:09

Jeff


You can actually make @Jeff's answer a little faster by only iterating up to (but not including) i + 1 in the nested loop, and because correlation is symmetric you can assign values at the same time. You can also move the mask[i] access outside of the nested loop, which is a tiny optimization but might yield some performance gains for very large frames.

l = len(df.columns)
results = np.zeros((l,l))
mask = pd.isnull(df)
for i in range(l):
    maski = mask[i]
    for j in range(i + 1):
           results[i,j] = results[j,i] = (maski & mask[j]).sum()
results = DataFrame(results,index=df.columns,columns=df.columns)
like image 24
Phillip Cloud Avatar answered Sep 25 '22 06:09

Phillip Cloud