Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stata Not Dropping Variables (in regression) due to Multicollinearity and I think it should

I am running a simple regression of race times against temperature just to develop some basic intuition. My data-set is very large and each observation is the race completion time of a unit in a given race, in a given year.

For starters I am running a very simple regression of race time on temperature bins.

Summary of temp variable:

            |              
Variable    |   Obs     Mean      Std. Dev   Min    Max
------------+--------------------------------------------
avg_temp_scc|  8309434  54.3      9.4         0      89

Summary of time variable:

Variable    |   Obs     Mean      Std. Dev   Min    Max
------------+--------------------------------------------
chiptime    |  8309434  267.5      59.6     122      1262

I decided to make 10 degree bins for temperature and regress time against those.

The code is:

    egen temp_trial = cut(avg_temp_scc), at(0,10,20,30,40,50,60,70,80,90)
    reg chiptime i.temp_trial

The output is

  Source |       SS       df       MS              Number of obs = 8309434
---------+------------------------------           F(  8,8309425) =69509.83
   Model |  1.8525e+09     8   231557659           Prob > F      =  0.0000
Residual |  2.7681e+108309425  3331.29368           R-squared     =  0.0627
    -----+--------------------------------           Adj R-squared =  0.0627
   Total |  2.9534e+108309433  3554.22521           Root MSE      =  57.717



     chiptime |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ----------+----------------------------------------------------------------
    temp_trial |
           10  |  -26.63549   2.673903    -9.96   0.000    -31.87625   -21.39474
           20  |   10.23883   1.796236     5.70   0.000      6.71827    13.75939
           30  |   -16.1049   1.678432    -9.60   0.000    -19.39457   -12.81523
           40  |  -13.97918   1.675669    -8.34   0.000    -17.26343   -10.69493
           50  |  -10.18371   1.675546    -6.08   0.000    -13.46772   -6.899695
           60  |  -.6865365   1.675901    -0.41   0.682    -3.971243     2.59817
           70  |   44.42869   1.676883    26.49   0.000     41.14206    47.71532
           80  |   23.63064   1.766566    13.38   0.000     20.16824    27.09305
         _cons |   273.1366   1.675256   163.04   0.000     269.8531      276.42

So stata correctly drops the one of the bins (in this case 0-10) of temperature.

Now I manually created the bins and ran the regression again:

    gen temp0 = 1 if temp_trial==0
    replace temp0 = 0 if temp_trial!=0

    gen temp1 = 1 if temp_trial == 10
    replace temp1 = 0 if temp_trial != 10

    gen temp2 = 1 if temp_trial==20
    replace temp2 = 0 if temp_trial!=20

    gen temp3 = 1 if temp_trial==30
    replace temp3 = 0 if temp_trial!=30

    gen temp4=1 if temp_trial==40
    replace temp4=0 if temp_trial!=40

    gen temp5=1 if temp_trial==50
    replace temp5=0 if temp_trial!=50

    gen temp6=1 if temp_trial==60
    replace temp6=0 if temp_trial!=60

    gen temp7=1 if temp_trial==70
    replace temp7=0 if temp_trial!=70

    gen temp8=1 if temp_trial==80
    replace temp8=0 if temp_trial!=80

    reg chiptime temp0 temp1 temp2 temp3 temp4 temp5 temp6 temp7 temp8

The output is:

     Source |       SS       df       MS              Number of obs = 8309434
   ---------+------------------------------           F(  9,8309424) =61786.51
      Model |  1.8525e+09     9   205829030           Prob > F      =  0.0000
   Residual |  2.7681e+108309424  3331.29408           R-squared     =  0.0627
    --------+------------------------------           Adj R-squared =  0.0627
      Total |  2.9534e+108309433  3554.22521           Root MSE      =  57.717


--------------------------------------------------------------------------
chiptime |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------+----------------------------------------------------------------
   temp0 |  -54.13245   6050.204    -0.01   0.993    -11912.32    11804.05
   temp1 |  -80.76794   6050.204    -0.01   0.989    -11938.95    11777.42
   temp2 |  -43.89362   6050.203    -0.01   0.994    -11902.08    11814.29
   temp3 |  -70.23735   6050.203    -0.01   0.991    -11928.42    11787.94
   temp4 |  -68.11162   6050.203    -0.01   0.991    -11926.29    11790.07
   temp5 |  -64.31615   6050.203    -0.01   0.992     -11922.5    11793.87
   temp6 |  -54.81898   6050.203    -0.01   0.993       -11913    11803.36
   temp7 |  -9.703755   6050.203    -0.00   0.999    -11867.89    11848.48
   temp8 |   -30.5018   6050.203    -0.01   0.996    -11888.68    11827.68
   _cons |    327.269   6050.203     0.05   0.957    -11530.91    12185.45

Note the bins are exhaustive of the entire data set and stata is including a constant in the regression and none of the bins are getting dropped. Is this not incorrect? Given that the constant is being included in the regression, shouldn't one of the bins get dropped to make it the "base case"? I feel as though I am missing something obvious here.

Edit: Here is a dropbox link for the data and do file: It contains only the two variables under consideration. The file is 129 mb. I also have a picture of my output at the link.

like image 593
user52932 Avatar asked Dec 07 '25 22:12

user52932


1 Answers

This too is not an answer, but an extended comment, since I'm tired of fighting with the 600-character limit and the freeze on editing after 5 minutes.

In the comment thread on the original post, @user52932 wrote

Thank you for verifying this. Can you elaborate on what exactly this precision issue is? Does this only cause problems in this multicollinearity issue? Could it be that when I am using factor variables this precision issue may cause my estimates to be wrong?

I want to be unambiguous that the results from the regression using factor variables are as correct as those of any well-specified regression can be.

In the regression using dummy variables, the model was misspecified to include a set of multicollinear variables. Stata is then faulted for failing to detect the multicollinearity.

But there's no magic test for multicollinearity. It's inferred from characteristics of the cross-products matrix. In this case the cross-products matrix represents 8.3 million observations, and despite Stata's use of double-precision throughout, the calculated matrix passed Stata's test and was not detected as containing a multicollinear set of variables. This is the locus of the precision problem to which I referred. Note that by reordering the observations, the accumulated cross-products matrix differed enough so that it now failed Stata's test, and the misspecification was detected.

Now look at the results in the original post obtained from this misspecified regression. Note that if you add 54.13245 to the coefficients on each of the dummy variables and subtract the same amount from the constant, the resulting coefficients and constant are identical to those in the regression using factor variables. This is the textbook definition of the problem with multicollinearity - not that the coefficient estimates are wrong, but that the coefficient estimates are not uniquely defined.

In a comment above, @user52932 wrote

I am unsure what Stata is using as the base case in my data.

The answer is that Stata used no base case; the results are what are to be expected when a set of multicollinear variables is included among the independent variables.

So this question is a reminder to us that statistical packages like Stata cannot infallibly detect multicollinearity. As it turns out, that's part of the genius of factor variable notation, I realize now. With factor variable notation, you tell Stata to create a set of dummy variables that by definition will be multicollinear, and since it understands that relationship between the dummy variables, it can eliminate the multicollinearity ex ante, before constructing the cross-products matrix, rather than attempt to infer the problem ex post, using the cross-products matrix's characteristics.

We should not be surprised that Stata occasionally fails to detect multicollinearity, but rather gratified that it does as well as it does at doing so. After all, the second model is indeed a misspecification, which constitutes an unambiguous violation of the assumptions of OLS regression on the user's part.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!