Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does regression in R delete index 1 of a factor variable? [duplicate]

I am trying to do a regression in R using the lm and the glm function.

My dependent variable is logit transformed data based on proportion of events over non-events within a given time period. So my dependent variable is continuous whereas my independent variable are factor variable or dummies.

I have two independent variables that can take the values of

  • Year i to year m, my YEAR variable
  • Month j to month n, my MONTH variable

The problem is that whenever I run my model as summaries the results April(index 1 for month) and 1998 (index 1 for year) is not within the results... if I change April to let's say "foo_bar", August will be missing...

Please help! This is frustrating me and I simply do not know how to search for a solution to the problem.

like image 877
Kasper Christensen Avatar asked Dec 03 '25 05:12

Kasper Christensen


1 Answers

If R were to create a dummy variable for every level in the factor, the resulting set of variables would be linearly dependent (assuming there is also an intercept term). Therefore, one factor level is chosen as the baseline and has no dummy generated for it.

To illustrate this, let's consider a toy example:

> data <- data.frame(y=c(2, 3, 5, 7, 11, 25), f=as.factor(c('a', 'a', 'b', 'b', 'c', 'c')))
> summary(lm(y ~ f, data))

Call:
lm(formula = y ~ f, data = data)

Residuals:
   1    2    3    4    5    6 
-0.5  0.5 -1.0  1.0 -7.0  7.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    2.500      4.093   0.611   0.5845  
fb             3.500      5.788   0.605   0.5880  
fc            15.500      5.788   2.678   0.0752 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 5.788 on 3 degrees of freedom
Multiple R-squared: 0.7245, Adjusted R-squared: 0.5409 
F-statistic: 3.945 on 2 and 3 DF,  p-value: 0.1446 

As you can see, there are three coefficients (the same as the number of levels in the factor). Here, a has been chosen as the baseline, so (Intercept) refers to the subset of data where f is a. The coefficients for b and c (fb and fc) are the differences between the baseline intercept and the intercepts for the two other factor levels. Thus the intercept for b is 6 (2.500+3.500) and the intercept for c is 19 (2.500+15.500).

If you don't like the automatic choice, you could pick another level as the baseline: How to force R to use a specified factor level as reference in a regression?

like image 54
NPE Avatar answered Dec 04 '25 21:12

NPE