Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does scikit-learn perform "real" multivariate regression (multiple dependent variables)?

I would like to predict multiple dependent variables using multiple predictors. If I understood correctly, in principle one could make a bunch of linear regression models that each predict one dependent variable, but if the dependent variables are correlated, it makes more sense to use multivariate regression. I would like to do the latter, but I'm not sure how.

So far I haven't found a Python package that specifically supports this. I've tried scikit-learn, and even though their linear regression model example only shows the case where y is an array (one dependent variable per observation), it seems to be able to handle multiple y. But when I compare the output of this "multivariate" method to the results I get by manually looping over each dependent variable and predicting them independently from each other, the outcome is exactly the same. I don't think this should be the case, because there is a strong correlation between some of the dependent variables (>0.5).

The code just looks like this, with y either a n x 1 matrix or n x m matrix, and x and newx matrices of various sizes (number of rows in x == n).

ols = linear_model.LinearRegression()
ols.fit(x,y)
ols.predict(newx)

Does this function actually perform multivariate regression?

like image 628
CSquare Avatar asked May 26 '15 15:05

CSquare


1 Answers

This is a mathematical/stats question, but I will try to answer it here anyway.

The outcome you see is absolutely expected. A linear model like this won't take correlation between dependent variables into account.

If you had only one dependent variable, your model would essentially consist of a weight vector

w_0  w_1  ...  w_n,

where n is the number of features. With m dependent variables, you instead have a weight matrix

w_10  w_11  ...  w_1n
w_20  w_21  ...  w_2n
....             ....
w_m0  w_m1  ...  w_mn

But the weights for different output variables (1, ..., m) are completely independent from each other, and since the total sum of squared errors splits up into a sum of squared errors over each output variable, minimizing the squared total loss is exactly the same as setting up one univariate linear model per output variable and minimizing their squared losses independently from each other.

like image 166
cfh Avatar answered Oct 19 '22 22:10

cfh