Is there any way to compute weighted correlation coefficient with pandas? I saw that R has such a method. Also, I'd like to get the p value of the correlation. This I did not find also in R. Link to Wikipedia for explanation about weighted correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient
Pandas makes it very easy to find the correlation coefficient! We can simply call the . corr() method on the dataframe of interest. The method returns a correlation matrix that shows the coefficient of correlation between different variables.
A weighted correlation allows you to apply a weight, or relative significance to each value comparison. Correlation comparisons with a higher value for their weight are considered as more significant when compared to the other value comparisons.
Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.
I don't know of any Python packages that implement this, but it should be fairly straightforward to roll your own implementation. Using the naming conventions of the wikipedia article:
def m(x, w):
"""Weighted Mean"""
return np.sum(x * w) / np.sum(w)
def cov(x, y, w):
"""Weighted Covariance"""
return np.sum(w * (x - m(x, w)) * (y - m(y, w))) / np.sum(w)
def corr(x, y, w):
"""Weighted Correlation"""
return cov(x, y, w) / np.sqrt(cov(x, x, w) * cov(y, y, w))
I tried to make the functions above match the formulas in the wikipedia as closely as possible, but there are some potential simplifications and performance improvements. For example, as pointed out by @Alberto Garcia-Raboso, m(x, w)
is really just np.average(x, weights=w)
, so there's no need to actually write a function for it.
The functions are pretty bare-bones, just doing the calculations. You may want to consider forcing inputs to be arrays prior to doing the calculations, i.e. x = np.asarray(x)
, as these functions will not work if lists are passed. Additional checks to verify all inputs have equal length, non-null values, etc. could also be implemented.
Example usage:
# Initialize a DataFrame.
np.random.seed([3,1415])
n = 10**6
df = pd.DataFrame({
'x': np.random.choice(3, size=n),
'y': np.random.choice(4, size=n),
'w': np.random.random(size=n)
})
# Compute the correlation.
r = corr(df['x'], df['y'], df['w'])
There's a discussion here regarding the p-value. It doesn't look like there's a generic calculation, and it depends on how you're actually getting the weights.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With