Suppose I have a pandas.DataFrame
called df
. The columns of df
represent different individuals and the index axis represents time, so the (i,j) entry is individual j's observation for time period i, and we can assume all data are float
type possibly with NaN
values.
In my case, I have about 14,000 columns and a few hundred rows.
pandas.corr
will give me back the 14,000-by-14,000 correlation matrix and it's time performance is fine for my application.
But I would also like to know, for each pair of individuals (j_1, j_2), how many non-null observations went into the correlation calculation, so I can isolate correlation cells that suffer from poor data coverage.
The best I've been able to come up with is the following:
not_null_locations = pandas.notnull(df).values.astype(int)
common_obs = pandas.DataFrame(not_null_locations.T.dot(not_null_locations),
columns=df.columns, index=df.columns)
The memory footprint and speed of this begin to be a bit problematic.
Is there any faster way to get at the common observations with pandas
?
You can do this, but would need to cythonize (otherwise much slower); however memory footprint should be better (this gives the number of nan observations, your gives number of valid observations, but easily convertible)
l = len(df.columns)
results = np.zeros((l,l))
mask = pd.isnull(df)
for i, ac in enumerate(df):
for j, bc in enumerate(df):
results[j,i] = (mask[i] & mask[j]).sum()
results = DataFrame(results,index=df.columns,columns=df.columns)
You can actually make @Jeff's answer a little faster by only iterating up to (but not including) i + 1
in the nested loop, and because correlation is symmetric you can assign values at the same time. You can also move the mask[i]
access outside of the nested loop, which is a tiny optimization but might yield some performance gains for very large frames.
l = len(df.columns)
results = np.zeros((l,l))
mask = pd.isnull(df)
for i in range(l):
maski = mask[i]
for j in range(i + 1):
results[i,j] = results[j,i] = (maski & mask[j]).sum()
results = DataFrame(results,index=df.columns,columns=df.columns)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With