I'm pretty new to pandas, so I guess I'm doing something wrong -
I have a DataFrame:
a b 0 0.5 0.75 1 0.5 0.75 2 0.5 0.75 3 0.5 0.75 4 0.5 0.75
df.corr()
gives me:
a b a NaN NaN b NaN NaN
but np.correlate(df["a"], df["b"])
gives: 1.875
Why is that? I want to have the correlation matrix for my DataFrame and thought that corr()
does that (at least according to the documentation). Why does it return NaN
?
What's the correct way to calculate?
Many thanks!
Print the input DataFrame, df. Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr.
It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association, and a negative value for r indicates a negative association. By using corr() function we can get the correlation between two columns in the dataframe.
The corr() aggregate function returns a coefficient of correlation between two numbers.
np.correlate calculates the (unnormalized) cross-correlation between two 1-dimensional sequences:
z[k] = sum_n a[n] * conj(v[n+k])
while df.corr (by default) calculates the Pearson correlation coefficient.
The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.
The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.
The fact that the standard deviation of df['a']
and df['b']
is zero is what causes df.corr
to be NaN everywhere.
From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:
you divide by a variance:
You can compute Beta
using np.cov
cov = np.cov(a, b) beta = cov[1, 0] / cov[0, 0]
import numpy as np import matplotlib.pyplot as plt np.random.seed(100) def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20): """ http://stackoverflow.com/a/13203189/190597 (unutbu) """ dt = float(T) / N t = np.linspace(0, T, N) W = np.random.standard_normal(size=N) W = np.cumsum(W) * np.sqrt(dt) # standard brownian motion ### X = (mu - 0.5 * sigma ** 2) * t + sigma * W S = S0 * np.exp(X) # geometric brownian motion ### return S N = 10 ** 6 a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N) b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N) cov = np.cov(a, b) print(cov) # [[ 0.38234755 0.80525967] # [ 0.80525967 1.73517501]] beta = cov[1, 0] / cov[0, 0] print(beta) # 2.10609347015 plt.plot(a) plt.plot(b) plt.show()
The ratio of mu
s is 2, and beta
is ~2.1.
And you could also compute it with df.corr
, though this is a much more round-about way of doing it (but it is nice to see there is consistency):
import pandas as pd df = pd.DataFrame({'a': a, 'b': b}) beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1] print(beta2) # 2.10609347015 assert np.allclose(beta, beta2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With