I'm pretty new to pandas, so I guess I'm doing something wrong - I have a DataFrame: <pre class="prettyprint"><code> a b 0 0.5 0.75 1 0.5 0.75 2 0.5 0.75 3 0.5 0.75 4 0.5 0.75 </code></pre> <code>df.corr()</code> gives me: <pre class="prettyprint"><code> a b a NaN NaN b NaN NaN </code></pre> but <code>np.correlate(df["a"], df["b"])</code> gives: <code>1.875</code> Why is that? I want to have the correlation matrix for my DataFrame and thought that <code>corr()</code> does that (at least according to the documentation). Why does it return <code>NaN</code>? What's the correct way to calculate? Many thanks!

np.correlate calculates the (unnormalized) cross-correlation between two 1-dimensional sequences: <pre class="prettyprint"><code>z[k] = sum_n a[n] * conj(v[n+k]) </code></pre> while df.corr (by default) calculates the Pearson correlation coefficient. The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded. The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient. The fact that the standard deviation of <code>df['a']</code> and <code>df['b']</code> is zero is what causes <code>df.corr</code> to be NaN everywhere. <hr> From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations: <img src="https://i.stack.imgur.com/ft4zy.png" alt="enter image description here"> you divide by a variance: <img src="https://i.stack.imgur.com/fHUQO.png" alt="enter image description here"> <hr> You can compute <code>Beta</code> using np.cov <pre class="prettyprint"><code>cov = np.cov(a, b) beta = cov[1, 0] / cov[0, 0] </code></pre> <hr> <pre class="prettyprint"><code>import numpy as np import matplotlib.pyplot as plt np.random.seed(100) def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20): """ http://stackoverflow.com/a/13203189/190597 (unutbu) """ dt = float(T) / N t = np.linspace(0, T, N) W = np.random.standard_normal(size=N) W = np.cumsum(W) * np.sqrt(dt) # standard brownian motion ### X = (mu - 0.5 * sigma ** 2) * t + sigma * W S = S0 * np.exp(X) # geometric brownian motion ### return S N = 10 ** 6 a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N) b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N) cov = np.cov(a, b) print(cov) # [[ 0.38234755 0.80525967] # [ 0.80525967 1.73517501]] beta = cov[1, 0] / cov[0, 0] print(beta) # 2.10609347015 plt.plot(a) plt.plot(b) plt.show() </code></pre> <img src="https://i.stack.imgur.com/xLLow.png" alt="enter image description here"> The ratio of <code>mu</code>s is 2, and <code>beta</code> is ~2.1. <hr> And you could also compute it with <code>df.corr</code>, though this is a much more round-about way of doing it (but it is nice to see there is consistency): <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame({'a': a, 'b': b}) beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1] print(beta2) # 2.10609347015 assert np.allclose(beta, beta2) </code></pre>

Correlation between columns in DataFrame

Tags:

python

pandas

I'm pretty new to pandas, so I guess I'm doing something wrong -

I have a DataFrame:

     a     b 0  0.5  0.75 1  0.5  0.75 2  0.5  0.75 3  0.5  0.75 4  0.5  0.75

df.corr() gives me:

    a   b a NaN NaN b NaN NaN

but np.correlate(df["a"], df["b"]) gives: 1.875

Why is that? I want to have the correlation matrix for my DataFrame and thought that corr() does that (at least according to the documentation). Why does it return NaN?

What's the correct way to calculate?

Many thanks!

581

asked Apr 06 '13 19:04

Zach Moshe

1 Answers

np.correlate calculates the (unnormalized) cross-correlation between two 1-dimensional sequences:

z[k] = sum_n a[n] * conj(v[n+k])

while df.corr (by default) calculates the Pearson correlation coefficient.

The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.

The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.

The fact that the standard deviation of df['a'] and df['b'] is zero is what causes df.corr to be NaN everywhere.

From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:

enter image description here

you divide by a variance:

enter image description here

You can compute Beta using np.cov

cov = np.cov(a, b) beta = cov[1, 0] / cov[0, 0]

import numpy as np import matplotlib.pyplot as plt np.random.seed(100)   def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):     """     http://stackoverflow.com/a/13203189/190597 (unutbu)     """     dt = float(T) / N     t = np.linspace(0, T, N)     W = np.random.standard_normal(size=N)     W = np.cumsum(W) * np.sqrt(dt)  # standard brownian motion ###     X = (mu - 0.5 * sigma ** 2) * t + sigma * W     S = S0 * np.exp(X)  # geometric brownian motion ###     return S  N = 10 ** 6 a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N) b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)  cov = np.cov(a, b) print(cov) # [[ 0.38234755  0.80525967] #  [ 0.80525967  1.73517501]] beta = cov[1, 0] / cov[0, 0] print(beta) # 2.10609347015  plt.plot(a) plt.plot(b) plt.show()

enter image description here

The ratio of mus is 2, and beta is ~2.1.

And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):

import pandas as pd df = pd.DataFrame({'a': a, 'b': b}) beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1] print(beta2) # 2.10609347015 assert np.allclose(beta, beta2)

112

answered Sep 18 '22 15:09

unutbu

Related questions
                            
                                pandas convert from datetime to integer timestamp
                            
                                How can you make a vote-up-down button like in Stackoverflow?
                            
                                How to read the first byte of a subprocess's stdout and then discard the rest in Python?
                            
                                accessing request headers on django/python
                            
                                Extracting data from HTML table
                            
                                python: datetime tzinfo time zone names documentation
                            
                                Tornado request.body
                            
                                Python PEP: blank line after function definition?
                            
                                pip: cert failed, but curl works
                            
                                How to access a module from outside your file folder in Python? [duplicate]
                            
                                How to run my python script on docker?
                            
                                ImportError: cannot import name 'cross_validation' from 'sklearn' [duplicate]
                            
                                overwriting file in ziparchive
                            
                                Can SQLAlchemy be used with Google Cloud SQL?
                            
                                Best way to make Flask-Login's login_required the default
                            
                                I can "pickle local objects" if I use a derived class?
                            
                                Django Standalone Script
                            
                                Converting list of tuples into a dictionary
                            
                                Find last match with python regular expression
                            
                                How to display only a left and bottom box border in matplotlib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correlation between columns in DataFrame

Tags:

python

pandas

Zach Moshe

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us