Fast way to see common observation counts for Python Pandas correlation matrix entries

Question

Suppose I have a pandas.DataFrame called df. The columns of df represent different individuals and the index axis represents time, so the (i,j) entry is individual j's observation for time period i, and we can assume all data are float type possibly with NaN values.

In my case, I have about 14,000 columns and a few hundred rows.

pandas.corr will give me back the 14,000-by-14,000 correlation matrix and it's time performance is fine for my application.

But I would also like to know, for each pair of individuals (j_1, j_2), how many non-null observations went into the correlation calculation, so I can isolate correlation cells that suffer from poor data coverage.

The best I've been able to come up with is the following:

not_null_locations = pandas.notnull(df).values.astype(int)
common_obs = pandas.DataFrame(not_null_locations.T.dot(not_null_locations),
                              columns=df.columns, index=df.columns)

The memory footprint and speed of this begin to be a bit problematic.

Is there any faster way to get at the common observations with pandas?

Jeff · Accepted Answer

You can do this, but would need to cythonize (otherwise much slower); however memory footprint should be better (this gives the number of nan observations, your gives number of valid observations, but easily convertible)

l = len(df.columns)
results = np.zeros((l,l))
mask = pd.isnull(df)
for i, ac in enumerate(df):
    for j, bc in enumerate(df):
           results[j,i] = (mask[i] & mask[j]).sum()
results = DataFrame(results,index=df.columns,columns=df.columns)

Phillip Cloud · Answer

You can actually make @Jeff's answer a little faster by only iterating up to (but not including) i + 1 in the nested loop, and because correlation is symmetric you can assign values at the same time. You can also move the mask[i] access outside of the nested loop, which is a tiny optimization but might yield some performance gains for very large frames.

l = len(df.columns)
results = np.zeros((l,l))
mask = pd.isnull(df)
for i in range(l):
    maski = mask[i]
    for j in range(i + 1):
           results[i,j] = results[j,i] = (maski & mask[j]).sum()
results = DataFrame(results,index=df.columns,columns=df.columns)

Fast way to see common observation counts for Python Pandas correlation matrix entries

Tags:

python

pandas

missing-data

numpy

ely

2 Answers

Jeff

Phillip Cloud

Recent Activity

Donate For Us

Fast way to see common observation counts for Python Pandas correlation matrix entries

Tags:

python

pandas

missing-data

numpy

ely

2 Answers

Jeff

Phillip Cloud

Related questions

Recent Activity

Donate For Us