Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing correlation matrix faster in Pandas

Tags:

python

pandas

I've identified as the bottleneck of my code the following operation on a given Pandas DataFrame df.

df.corr()

I was wondering whether there exist some drop-in replacements to speed this step up?

Thank you!

like image 242
sdgaw erzswer Avatar asked Apr 24 '26 02:04

sdgaw erzswer


1 Answers

You might try numpy.corrcoef instead:

pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns)

Example Timings

# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.randn(1000, 1000))

df.corr()
# 15 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns)
# 24.4 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 105
Chris Adams Avatar answered Apr 25 '26 16:04

Chris Adams



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!