Suppose I want to run a regression using lm
and a factor
as a right hand side variable. What is the best way to choose which level in the factor is the base category (the one that is excluded to avoid multicollinearity). Note that I am not interested in excluding the intercept because I have many factors.
I would also like a formula-based solution, not one that acts on the data.frame directly, although if you think you have a really good solution for that, please post it as well.
My solution is:
base_cat <- function(x) c(x,1:(x-1),(x+1):100)
a_reg <- lm(y ~ x1 + x2 + factor(x3, levels=base_cat(30)) #suppose that x3 has draws from the integers 1 to 100.
The left out category by lm
is the first level in the factor so this just reorders the levels so that the one specified in base_cat()
is the first one, and puts the rest after.
Any other ideas?
We can check if a variable is a factor or not using class() function. Similarly, levels of a factor can be checked using the levels() function.
To specify the manual reference factor level in the R Language, we will use the relevel() function. The relevel() function is used to reorder the factor vector so that the level specified by the user is first and others are moved down.
When building a linear or logistic regression model, you should consider including: Variables that are already proven in the literature to be related to the outcome. Variables that can either be considered the cause of the exposure, the outcome, or both. Interaction terms of variables that have large main effects.
The function relevel
does precisely this. You pass it an unordered factor and the name of the reference level and it returns a factor with that level as the first one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With