pandas corr and corrwith very slow

Q: Why are pandas so slow?

Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn't do any of that, so it can perform the same calculations significantly faster.

Q: What does Corr () do in pandas?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.

Tags:

python

pandas

dataframe

I have a pandas dataframe with <30K rows, and 7 columns and I'm trying to get the correlation of 4 of the columns to the fifth one. The problem is, I'd like to do this with massive datasets but this takes ~40s to run. Here is my code:

df_a = dfr[['id', 'state', 'perform', 'A']].groupby(['id', 'state']).corr().ix[1::2][['A']].reset_index(2).drop('level_2', axis=1)
df_b = dfr[['id', 'state', 'perform', 'B']].groupby(['id', 'state']).corr().ix[1::2][['B']].reset_index(2).drop('level_2', axis=1)
df_c = dfr[['id', 'state', 'perform', 'C']].groupby(['id', 'state']).corr().ix[1::2][['C']].reset_index(2).drop('level_2', axis=1)
df_d = dfr[['id', 'state', 'perform', 'D']].groupby(['id', 'state']).corr().ix[1::2][['D']].reset_index(2).drop('level_2', axis=1)

df = df_a.merge(df_b, left_index=True, right_index=True)
df = df.merge(df_c, left_index=True, right_index=True)
df = df.merge(df_d, left_index=True, right_index=True)

Sample data looks as follows:

ID   State   perform   A   B   C   D
234   AK     75.8456   1   0   0   0
284   MN     78.6752   0   0   1   0

Does anyone have any tips on how I could make this faster, or implement this method better?

Thank you!

669

asked Jan 15 '18 21:01

TMarks

1 Answers

The reason pandas corr is very slow is that it considers NANs: it is basically a cython for-loop.

If your data doesn't have NANs, numpy.corrcoef is much faster.

answered Oct 22 '22 22:10

Ma Ming

Related questions
                            
                                How to find the exact intersection of a curve (as np.array) with y==0?
                            
                                Matlab repr function
                            
                                Reducing the number of arguments in function in Python?
                            
                                How to make new decorators available within a class without explicitly importing them?
                            
                                Googleapiclient and python3
                            
                                How to read the contents of a csv file into a class with each csv row as a class instance
                            
                                Translate using dictionaries
                            
                                Cuda GPU is slower than CPU in simple numpy operation
                            
                                How can I select a html element no matter what frame it is in in selenium?
                            
                                Python passing self to the decorator
                            
                                Pandas - Convert columns to new rows after groupby
                            
                                parent-child relationship query in simple_salesforce python, extracting from ordered dicts
                            
                                method object is not JSON serializable
                            
                                Python __dict__
                            
                                Installation of PyCairo on Windows
                            
                                Removing leading zeros from pandas.core.series.Series
                            
                                I want to know the sample bucket name in boto3
                            
                                Headless chrome with selenium, can only find ways to scroll non-headless
                            
                                how to get unique values in all columns in pandas data frame
                            
                                How does SQLAlchemy create_engine import Engine class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With