Given a set of variables, x
's. I want to find the values of coefficients for this equation:
y = a_1*x_1 +... +a_n*x_n + c
where a_1,a_2,...,a_n
are all unknowns. Thinking this in perspective of data frame, I want to create this value of y
for every rows in the data.
My question is: for y, a_1...a_n
and c
are all unknown, is there a way for me to find a set of solutions a_1,...,a_n
under the condition that corr(y,x_1), corr(y,x_2) .... corr(y,x_n)
are all greater than 0.7. For simplicity take correlation here as Pearson correlation. I know there would no be unique solution. But how can I construct a set of solutions for a_1,...,a_n
to fulfill this condition?
Spent a day to search the idea but could not get any information out of it. Any programming language to tackle this problem is welcomed or at least some reference for this.
No, it is not possible in general. It may be possible in some special cases.
Given x₁, x₂, ... you want to find y = a₁x₁ + a₂x₂ + ... + c so that all the correlations between y and the x's are greater than some target R. Since the correlation is
Corr(y, xi) = Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ]
your constraint is
Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ] > R
which can be rearranged to
Cov(y, xi)² > R² * Var(y) * Var(xi)
and this needs to be true for all i.
Consider the simple case where there are only two columns x₁ and x₂, and further assume that they both have mean zero (so you can ignore the constant c) and variance 1, and that they are uncorrelated. In that case y = a₁x₁ + a₂x₂ and the covariances and variances are
Cov(y, x₁) = a₁
Cov(y, x₂) = a₂
Var(x₁) = 1
Var(x₂) = 1
Var(y) = (a₁)² + (a₂)²
so you need to simultaneously satisfy
(a₁)² > R² * ((a₁)² + (a₂)²)
(a₂)² > R² * ((a₁)² + (a₂)²)
Adding these inequalities together, you get
(a₁)² + (a₂)² > 2 * R² * ((a₁)² + (a₂)²)
which means that in order to satisfy both of the inequalities, you must have R < Sqrt(1/2) (by cancelling common factors on both sides of the inequality). So the very best you could do in this simple case is to choose a₁ = a₂ (the exact value doesn't matter as long as they are equal) and both of the correlations Corr(y,a₁) and Corr(y,a₂) will be equal to 0.707. You cannot achieve correlations higher than this between y and all of the x's simultaneously in this case.
For the more general case with n
columns (each of which has mean zero, variance 1 and zero correlation between columns) you cannot simultaneously achieve correlations greater than 1 / sqrt(n)
(as pointed out in the comments by @kazemakase).
In general, the more independent variables there are, the lower the correlation you will be able to achieve between y and the x's. Also (although I haven't mentioned it above) the correlations between the x's matter. If they are in general positively correlated, you will be able to achieve a higher target correlation between y and the x's. If they are in general uncorrelated or negatively correlated, you will only be able to achieve low correlations between y and the x's.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With