Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple Linear Regression with specific constraint on each coefficients on Python

I am currently running multiple linear regression on a dataset. At first, I didn't realize I needed to put constraints over my weights; as a matter of fact, I need to have specific positive & negative weights.

To be more precise, I am doing a scoring system and this is why some of my variables should have a positive or negative impact on the note. Yet, when running my model, the results do not fit what I am expecting, some of my 'positive' variables get negative coefficients and vice versa.

As an example, let's suppose my model is :

y = W0*x0 + W1*x1 + W2*x2 

Where x2 is a 'positive' variable, I would like to put a constraint over W2 to be positive !

I have been looking around a lot about this issue but I've not found anything about constraints on specific weights/coefficients, all that I've found is about setting all coefficients positive or summing them to one.

I am working on Python using the ScikitLearn packages. This is how I get my best model :

def ridge(Xtrain, Xtest, Ytrain, Ytest, position):
    param_grid={'alpha':[0.01 , 0.1, 1, 10, 50, 100, 1000]}
    gs = grid_search.GridSearchCV(Ridge(), param_grid=param_grid, n_jobs=-1, cv=3)
    gs.fit(Xtrain, Ytrain)
    hatytrain = gs.predict(Xtrain)
    hatytest = gs.predict(Xtest)

Any idea of how I could assign a constraint on the coefficient of a specific variable ? Probably going to be burdensome to define each constraint but I have no idea how to do otherwise.

like image 473
Benjamin Salem Avatar asked May 18 '18 11:05

Benjamin Salem


People also ask

Can you have covariates in multiple regression?

Introducing a covariate to a multiple regression model is very similar to conducting sequential multiple regression (sometimes called hierarchical multiple regression). In each of these situations, blocks are used to enter specific variables (be they predictors or covariates) into the model in chunks.

What does LinearRegression fit () do in Python?

Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.


2 Answers

Scikit-learn does not allow such constraints on the coefficients.

But you can impose any constraints on coefficients and optimize the loss with coordinate descent if you implement your own estimator. In the unconstraint case, coordinate descent produces the same result as OLS in reasonable number of iterations.

I've written a class that imposes upper and lower bounds on LinearRegression coefficients. You can extend it to use Ridge or evel Lasso penalty if you want:

from sklearn.linear_model.base import LinearModel
from sklearn.base import RegressorMixin
from sklearn.utils import check_X_y
import numpy as np

class ConstrainedLinearRegression(LinearModel, RegressorMixin):

    def __init__(self, fit_intercept=True, normalize=False, copy_X=True, nonnegative=False, tol=1e-15):
        self.fit_intercept = fit_intercept
        self.normalize = normalize
        self.copy_X = copy_X
        self.nonnegative = nonnegative
        self.tol = tol

    def fit(self, X, y, min_coef=None, max_coef=None):
        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], y_numeric=True, multi_output=False)
        X, y, X_offset, y_offset, X_scale = self._preprocess_data(
            X, y, fit_intercept=self.fit_intercept, normalize=self.normalize, copy=self.copy_X)
        self.min_coef_ = min_coef if min_coef is not None else np.repeat(-np.inf, X.shape[1])
        self.max_coef_ = max_coef if max_coef is not None else np.repeat(np.inf, X.shape[1])
        if self.nonnegative:
            self.min_coef_ = np.clip(self.min_coef_, 0, None)

        beta = np.zeros(X.shape[1]).astype(float)
        prev_beta = beta + 1
        hessian = np.dot(X.transpose(), X)
        while not (np.abs(prev_beta - beta)<self.tol).all():
            prev_beta = beta.copy()
            for i in range(len(beta)):
                grad = np.dot(np.dot(X,beta) - y, X)
                beta[i] = np.minimum(self.max_coef_[i], 
                                     np.maximum(self.min_coef_[i], 
                                                beta[i]-grad[i] / hessian[i,i]))

        self.coef_ = beta
        self._set_intercept(X_offset, y_offset, X_scale)
        return self    

You can use this class, for example, to make all coefficients non-negative

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X, y = load_boston(return_X_y=True)
model = ConstrainedLinearRegression(nonnegative=True)
model.fit(X, y)
print(model.intercept_)
print(model.coef_)

This produces an output like

-36.99292986145538
[0.         0.05286515 0.         4.12512386 0.         8.04017956
 0.         0.         0.         0.         0.         0.02273805
 0.        ]

You can see that most coefficients are zero. An ordinary LinearModel would have made them negative:

model = LinearRegression()
model.fit(X, y)
print(model.intercept_)
print(model.coef_)

which would return to you

36.49110328036191
[-1.07170557e-01  4.63952195e-02  2.08602395e-02  2.68856140e+00
 -1.77957587e+01  3.80475246e+00  7.51061703e-04 -1.47575880e+00
  3.05655038e-01 -1.23293463e-02 -9.53463555e-01  9.39251272e-03
 -5.25466633e-01]

You can also impose arbitrary bounds for any coefficients you choose - that's what you asked for. For example, in this setup

model = ConstrainedLinearRegression()
min_coef = np.repeat(-np.inf, X.shape[1])
min_coef[0] = 0
min_coef[4] = -1
max_coef = np.repeat(4, X.shape[1])
max_coef[3] = 2
model.fit(X, y, max_coef=max_coef, min_coef=min_coef)
print(model.intercept_)
print(model.coef_)

you would get an output

24.060175576410515
[ 0.          0.04504673 -0.0354073   2.         -1.          4.
 -0.01343263 -1.17231216  0.2183103  -0.01375266 -0.7747823   0.01122374
 -0.56678676]

Update. This solution can be adapted to work with constraints on linear combinations of the coeffitients (e.g. their sum) - in this case, individual constraints for each coefficient would be recalculated on each step. This Github gist provides an example.

UPDATE Due to popularity of this question, I have created a package with my implementation of constrained linear regression: https://github.com/avidale/constrained-linear-regression. You can install it with pip install constrained-linear-regression. Pull requests are welcome!

like image 148
David Dale Avatar answered Oct 17 '22 19:10

David Dale


In version 0.24.2 of scikit-learn, you can force the algorithm to use positive coefficients by using the parameter positive=True to the LinearRegression, by multiplying the columns for which you want a negative coefficient by -1 you should get what you want.

like image 1
Benjamin Breton Avatar answered Oct 17 '22 17:10

Benjamin Breton