I would like to predict multiple dependent variables using multiple predictors. If I understood correctly, in principle one could make a bunch of linear regression models that each predict one dependent variable, but if the dependent variables are correlated, it makes more sense to use multivariate regression. I would like to do the latter, but I'm not sure how.
So far I haven't found a Python package that specifically supports this. I've tried scikit-learn, and even though their linear regression model example only shows the case where y is an array (one dependent variable per observation), it seems to be able to handle multiple y. But when I compare the output of this "multivariate" method to the results I get by manually looping over each dependent variable and predicting them independently from each other, the outcome is exactly the same. I don't think this should be the case, because there is a strong correlation between some of the dependent variables (>0.5).
The code just looks like this, with y
either a n x 1
matrix or n x m
matrix, and x
and newx
matrices of various sizes (number of rows in x == n
).
ols = linear_model.LinearRegression()
ols.fit(x,y)
ols.predict(newx)
Does this function actually perform multivariate regression?
This is a mathematical/stats question, but I will try to answer it here anyway.
The outcome you see is absolutely expected. A linear model like this won't take correlation between dependent variables into account.
If you had only one dependent variable, your model would essentially consist of a weight vector
w_0 w_1 ... w_n,
where n
is the number of features. With m
dependent variables, you instead have a weight matrix
w_10 w_11 ... w_1n
w_20 w_21 ... w_2n
.... ....
w_m0 w_m1 ... w_mn
But the weights for different output variables (1, ..., m) are completely independent from each other, and since the total sum of squared errors splits up into a sum of squared errors over each output variable, minimizing the squared total loss is exactly the same as setting up one univariate linear model per output variable and minimizing their squared losses independently from each other.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With