Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas corr and corrwith very slow

I have a pandas dataframe with <30K rows, and 7 columns and I'm trying to get the correlation of 4 of the columns to the fifth one. The problem is, I'd like to do this with massive datasets but this takes ~40s to run. Here is my code:

df_a = dfr[['id', 'state', 'perform', 'A']].groupby(['id', 'state']).corr().ix[1::2][['A']].reset_index(2).drop('level_2', axis=1)
df_b = dfr[['id', 'state', 'perform', 'B']].groupby(['id', 'state']).corr().ix[1::2][['B']].reset_index(2).drop('level_2', axis=1)
df_c = dfr[['id', 'state', 'perform', 'C']].groupby(['id', 'state']).corr().ix[1::2][['C']].reset_index(2).drop('level_2', axis=1)
df_d = dfr[['id', 'state', 'perform', 'D']].groupby(['id', 'state']).corr().ix[1::2][['D']].reset_index(2).drop('level_2', axis=1)

df = df_a.merge(df_b, left_index=True, right_index=True)
df = df.merge(df_c, left_index=True, right_index=True)
df = df.merge(df_d, left_index=True, right_index=True)

Sample data looks as follows:

ID   State   perform   A   B   C   D
234   AK     75.8456   1   0   0   0
284   MN     78.6752   0   0   1   0

Does anyone have any tips on how I could make this faster, or implement this method better?

Thank you!

like image 669
TMarks Avatar asked Jan 15 '18 21:01

TMarks


People also ask

Why are pandas so slow?

Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn't do any of that, so it can perform the same calculations significantly faster.

What does Corr () do in pandas?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.


1 Answers

The reason pandas corr is very slow is that it considers NANs: it is basically a cython for-loop.

If your data doesn't have NANs, numpy.corrcoef is much faster.

like image 80
Ma Ming Avatar answered Oct 22 '22 22:10

Ma Ming