Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting rank deficient warning when using regress function in MATLAB

I have a dataset comprising of 30 independent variables and I tried performing linear regression in MATLAB R2010b using the regress function.

I get a warning stating that my matrix X is rank deficient to within machine precision.

Now, the coefficients I get after executing this function don't match with the experimental one.

Can anyone please suggest me how to perform the regression analysis for this equation which is comprising of 30 variables?

like image 961
Prav001 Avatar asked Mar 20 '15 17:03

Prav001


1 Answers

Going with our discussion, the reason why you are getting that warning is because you have what is known as an underdetermined system. Basically, you have a set of constraints where you have more variables that you want to solve for than the data that is available. One example of an underdetermined system is something like:

x + y + z = 1
x + y + 2z = 3

There are an infinite number of combinations of (x,y,z) that can solve the above system. For example, (x, y, z) = (1, −2, 2), (2, −3, 2), and (3, −4, 2). What rank deficient means in your case is that there is more than one set of regression coefficients that would satisfy the governing equation that would describe the relationship between your input variables and output observations. This is probably why the output of regress isn't matching up with your ground truth regression coefficients. Though it isn't the same answer, do know that the output is one possible answer. By running through regress with your data, this is what I get if I define your observation matrix to be X and your output vector to be Y:

>> format long g;
>> B = regress(Y, X);
>> B

B =

                         0
                         0
          28321.7264417536
                         0
          35241.9719076362
          899.386999172398
         -95491.6154990829
         -2879.96318251964
         -31375.7038251919
          5993.52959752106
                         0
          18312.6649115112
                         0
                         0
           8031.4391233753
          27923.2569044728
          7716.51932560781
         -13621.1638587172
          36721.8387047613
          80622.0849069525
         -114048.707780113
         -70838.6034825939
         -22843.7931997405
          5345.06937207617
                         0
          106542.307940305
         -14178.0346010715
         -20506.8096166108
         -2498.51437396558
           6783.3107243113

You can see that there are seven regression coefficients that are equal to 0, which corresponds to 30 - 23 = 7. We have 30 variables and 23 constraints to work with. Be advised that this is not the only possible solution. regress essentially computes the least squared error solution such that sum of residuals of Y - X*B has the least amount of error. This essentially simplifies to:

B = X^(*)*Y

X^(*) is what is known as the pseudo-inverse of the matrix. MATLAB has this available, and it is called pinv. Therefore, if we did:

B = pinv(X)*Y

We get:

B =

          44741.6923363563
           32972.479220139
         -31055.2846404536
         -22897.9685877566
          28888.7558524005
          1146.70695371731
         -4002.86163441217
           9161.6908044046
         -22704.9986509788
          5526.10730457192
          9161.69080479427
          2607.08283489226
          2591.21062004404
         -31631.9969765197
         -5357.85253691504
          6025.47661106009
          5519.89341411127
         -7356.00479046122
         -15411.5144034056
          49827.6984426955
         -26352.0537850382
         -11144.2988973666
         -14835.9087945295
         -121.889618144655
         -32355.2405829636
          53712.1245333841
         -1941.40823106236
         -10929.3953469692
         -3817.40117809984
          2732.64066796307

You see that there are no zero coefficients because pinv finds the solution using the L2-norm, which promotes the "spreading" out of the errors (for a lack of a better term). You can verify that these are correct regression coefficients by doing:

>> Y2 = X*B

Y2 =

      16.1491563400241
      16.1264219600856
       16.525331600049
      17.3170318001845
      16.7481541301999
      17.3266932502295
      16.5465094100486
      16.5184456100487
      16.8428701100165
      17.0749421099829
      16.7393450000517
      17.2993993099419
      17.3925811702017
      17.3347117202356
      17.3362798302375
      17.3184486799219
      17.1169638102517
      17.2813552099096
      16.8792925100727
      17.2557945601102
       17.501873690151
      17.6490477001922
      17.7733493802508

Similarly, if we used the regression coefficients from regress, so B = regress(Y,X); then doing Y2 = X*B, we get:

Y2 =

      16.1491563399927
      16.1264219599996
      16.5253315999987
      17.3170317999969
      16.7481541299967
      17.3266932499992
      16.5465094099978
      16.5184456099983
      16.8428701099975
      17.0749421099985
      16.7393449999981
      17.2993993099983
      17.3925811699993
      17.3347117199991
      17.3362798299967
      17.3184486799987
      17.1169638100025
       17.281355209999
      16.8792925099983
      17.2557945599979
      17.5018736899983
      17.6490476999977
      17.7733493799981

There are some slight computational differences, which is to be expected. Similarly, we can also find the answer by using mldivide:

B = X \ Y

B =

                         0
                         0
           28321.726441712
                         0
          35241.9719075889
          899.386999170666
         -95491.6154989513
         -2879.96318251572
         -31375.7038251485
          5993.52959751295
                         0
          18312.6649114859
                         0
                         0
          8031.43912336425
          27923.2569044349
          7716.51932559712
         -13621.1638586983
          36721.8387047123
          80622.0849068411
         -114048.707779954
         -70838.6034824987
         -22843.7931997086
          5345.06937206919
                         0
          106542.307940158
         -14178.0346010521
         -20506.8096165825
         -2498.51437396236
          6783.31072430201

You can see that this curiously matches up with what regress gives you. That's because \ is a more smarter operator. Depending on how your matrix is structured, it finds the solution to the system by a different method. I'd like to defer you to the post by Amro that talks about what algorithms mldivide uses when examining the properties of the input matrix being operated on:

How to implement Matlab's mldivide (a.k.a. the backslash operator "\")


What you should take away from this answer is that you can certainly go ahead and use those regression coefficients and they will more or less give you the expected output for each value of Y with each set of inputs for X. However, be warned that those coefficients are not unique. This is apparent as you said that you have ground truth coefficients that don't match up with the output of regress. It isn't matching up because it generated another answer that satisfies the constraints you have provided.

There is more than one answer that can describe that relationship if you have an underdetermined system, as you have seen by my experiments shown above.

like image 134
rayryeng Avatar answered Nov 11 '22 12:11

rayryeng