Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation between columns in DataFrame

Tags:

python

pandas

I'm pretty new to pandas, so I guess I'm doing something wrong -

I have a DataFrame:

     a     b 0  0.5  0.75 1  0.5  0.75 2  0.5  0.75 3  0.5  0.75 4  0.5  0.75 

df.corr() gives me:

    a   b a NaN NaN b NaN NaN 

but np.correlate(df["a"], df["b"]) gives: 1.875

Why is that? I want to have the correlation matrix for my DataFrame and thought that corr() does that (at least according to the documentation). Why does it return NaN?

What's the correct way to calculate?

Many thanks!

like image 581
Zach Moshe Avatar asked Apr 06 '13 19:04

Zach Moshe


People also ask

How do you find the correlation between columns in a data frame?

Print the input DataFrame, df. Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr.

What is correlation between columns pandas?

It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association, and a negative value for r indicates a negative association. By using corr() function we can get the correlation between two columns in the dataframe.

What does Corr () return?

The corr() aggregate function returns a coefficient of correlation between two numbers.


1 Answers

np.correlate calculates the (unnormalized) cross-correlation between two 1-dimensional sequences:

z[k] = sum_n a[n] * conj(v[n+k]) 

while df.corr (by default) calculates the Pearson correlation coefficient.

The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.

The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.

The fact that the standard deviation of df['a'] and df['b'] is zero is what causes df.corr to be NaN everywhere.


From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:

enter image description here

you divide by a variance:

enter image description here


You can compute Beta using np.cov

cov = np.cov(a, b) beta = cov[1, 0] / cov[0, 0] 

import numpy as np import matplotlib.pyplot as plt np.random.seed(100)   def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):     """     http://stackoverflow.com/a/13203189/190597 (unutbu)     """     dt = float(T) / N     t = np.linspace(0, T, N)     W = np.random.standard_normal(size=N)     W = np.cumsum(W) * np.sqrt(dt)  # standard brownian motion ###     X = (mu - 0.5 * sigma ** 2) * t + sigma * W     S = S0 * np.exp(X)  # geometric brownian motion ###     return S  N = 10 ** 6 a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N) b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)  cov = np.cov(a, b) print(cov) # [[ 0.38234755  0.80525967] #  [ 0.80525967  1.73517501]] beta = cov[1, 0] / cov[0, 0] print(beta) # 2.10609347015  plt.plot(a) plt.plot(b) plt.show() 

enter image description here

The ratio of mus is 2, and beta is ~2.1.


And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):

import pandas as pd df = pd.DataFrame({'a': a, 'b': b}) beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1] print(beta2) # 2.10609347015 assert np.allclose(beta, beta2) 
like image 112
unutbu Avatar answered Sep 18 '22 15:09

unutbu