I have a pandas dataframe with <30K rows, and 7 columns and I'm trying to get the correlation of 4 of the columns to the fifth one. The problem is, I'd like to do this with massive datasets but this takes ~40s to run. Here is my code:
df_a = dfr[['id', 'state', 'perform', 'A']].groupby(['id', 'state']).corr().ix[1::2][['A']].reset_index(2).drop('level_2', axis=1)
df_b = dfr[['id', 'state', 'perform', 'B']].groupby(['id', 'state']).corr().ix[1::2][['B']].reset_index(2).drop('level_2', axis=1)
df_c = dfr[['id', 'state', 'perform', 'C']].groupby(['id', 'state']).corr().ix[1::2][['C']].reset_index(2).drop('level_2', axis=1)
df_d = dfr[['id', 'state', 'perform', 'D']].groupby(['id', 'state']).corr().ix[1::2][['D']].reset_index(2).drop('level_2', axis=1)
df = df_a.merge(df_b, left_index=True, right_index=True)
df = df.merge(df_c, left_index=True, right_index=True)
df = df.merge(df_d, left_index=True, right_index=True)
Sample data looks as follows:
ID State perform A B C D
234 AK 75.8456 1 0 0 0
284 MN 78.6752 0 0 1 0
Does anyone have any tips on how I could make this faster, or implement this method better?
Thank you!
Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn't do any of that, so it can perform the same calculations significantly faster.
corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.
The reason pandas corr is very slow is that it considers NANs: it is basically a cython for-loop.
If your data doesn't have NANs, numpy.corrcoef is much faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With