Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to adjust coefficient of equations to obtain high correlation between y and x_i?

Tags:

math

Given a set of variables, x's. I want to find the values of coefficients for this equation:

y = a_1*x_1 +... +a_n*x_n + c

where a_1,a_2,...,a_n are all unknowns. Thinking this in perspective of data frame, I want to create this value of y for every rows in the data.

My question is: for y, a_1...a_n and c are all unknown, is there a way for me to find a set of solutions a_1,...,a_n under the condition that corr(y,x_1), corr(y,x_2) .... corr(y,x_n) are all greater than 0.7. For simplicity take correlation here as Pearson correlation. I know there would no be unique solution. But how can I construct a set of solutions for a_1,...,a_n to fulfill this condition?

Spent a day to search the idea but could not get any information out of it. Any programming language to tackle this problem is welcomed or at least some reference for this.

like image 751
skw1990 Avatar asked Sep 25 '22 11:09

skw1990


1 Answers

No, it is not possible in general. It may be possible in some special cases.

Given x₁, x₂, ... you want to find y = a₁x₁ + a₂x₂ + ... + c so that all the correlations between y and the x's are greater than some target R. Since the correlation is

Corr(y, xi) = Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ]

your constraint is

Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ] > R

which can be rearranged to

Cov(y, xi)² > R² * Var(y) * Var(xi)

and this needs to be true for all i.

Consider the simple case where there are only two columns x₁ and x₂, and further assume that they both have mean zero (so you can ignore the constant c) and variance 1, and that they are uncorrelated. In that case y = a₁x₁ + a₂x₂ and the covariances and variances are

Cov(y, x₁) = a₁
Cov(y, x₂) = a₂
Var(x₁)    = 1
Var(x₂)    = 1
Var(y)     = (a₁)² + (a₂)²

so you need to simultaneously satisfy

(a₁)² > R² * ((a₁)² + (a₂)²)
(a₂)² > R² * ((a₁)² + (a₂)²)

Adding these inequalities together, you get

(a₁)² + (a₂)² > 2 * R² * ((a₁)² + (a₂)²)

which means that in order to satisfy both of the inequalities, you must have R < Sqrt(1/2) (by cancelling common factors on both sides of the inequality). So the very best you could do in this simple case is to choose a₁ = a₂ (the exact value doesn't matter as long as they are equal) and both of the correlations Corr(y,a₁) and Corr(y,a₂) will be equal to 0.707. You cannot achieve correlations higher than this between y and all of the x's simultaneously in this case.

For the more general case with n columns (each of which has mean zero, variance 1 and zero correlation between columns) you cannot simultaneously achieve correlations greater than 1 / sqrt(n) (as pointed out in the comments by @kazemakase).

In general, the more independent variables there are, the lower the correlation you will be able to achieve between y and the x's. Also (although I haven't mentioned it above) the correlations between the x's matter. If they are in general positively correlated, you will be able to achieve a higher target correlation between y and the x's. If they are in general uncorrelated or negatively correlated, you will only be able to achieve low correlations between y and the x's.

like image 92
Chris Taylor Avatar answered Sep 29 '22 06:09

Chris Taylor