Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weighted correlation coefficient with pandas

Is there any way to compute weighted correlation coefficient with pandas? I saw that R has such a method. Also, I'd like to get the p value of the correlation. This I did not find also in R. Link to Wikipedia for explanation about weighted correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient

like image 499
Yehuda Karlinsky Avatar asked Jul 28 '16 16:07

Yehuda Karlinsky


People also ask

How do you calculate the correlation coefficient of a panda?

Pandas makes it very easy to find the correlation coefficient! We can simply call the . corr() method on the dataframe of interest. The method returns a correlation matrix that shows the coefficient of correlation between different variables.

What is weighted correlation?

A weighted correlation allows you to apply a weight, or relative significance to each value comparison. Correlation comparisons with a higher value for their weight are considered as more significant when compared to the other value comparisons.

How do you find the correlation coefficient between two columns in pandas?

Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.


1 Answers

I don't know of any Python packages that implement this, but it should be fairly straightforward to roll your own implementation. Using the naming conventions of the wikipedia article:

def m(x, w):
    """Weighted Mean"""
    return np.sum(x * w) / np.sum(w)

def cov(x, y, w):
    """Weighted Covariance"""
    return np.sum(w * (x - m(x, w)) * (y - m(y, w))) / np.sum(w)

def corr(x, y, w):
    """Weighted Correlation"""
    return cov(x, y, w) / np.sqrt(cov(x, x, w) * cov(y, y, w))

I tried to make the functions above match the formulas in the wikipedia as closely as possible, but there are some potential simplifications and performance improvements. For example, as pointed out by @Alberto Garcia-Raboso, m(x, w) is really just np.average(x, weights=w), so there's no need to actually write a function for it.

The functions are pretty bare-bones, just doing the calculations. You may want to consider forcing inputs to be arrays prior to doing the calculations, i.e. x = np.asarray(x), as these functions will not work if lists are passed. Additional checks to verify all inputs have equal length, non-null values, etc. could also be implemented.

Example usage:

# Initialize a DataFrame.
np.random.seed([3,1415])
n = 10**6
df = pd.DataFrame({
    'x': np.random.choice(3, size=n),
    'y': np.random.choice(4, size=n),
    'w': np.random.random(size=n)
    })

# Compute the correlation.
r = corr(df['x'], df['y'], df['w'])

There's a discussion here regarding the p-value. It doesn't look like there's a generic calculation, and it depends on how you're actually getting the weights.

like image 59
root Avatar answered Sep 21 '22 12:09

root