Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using LASSO in R with categorical variables

I've got a dataset with 1000 observations and 76 variables, about twenty of which are categorical. I want to use LASSO on this entire data set. I know that having factor variables doesn't really work in LASSO through either lars or glmnet, but the variables are too many and there are too many different, unordered values they can take on to reasonably recode them numerically.

Can LASSO be used in this situation? How do I do this? Creating a matrix of the predictors yields this response:

hdy<-as.numeric(housingData2[,75])
hdx<-as.matrix(housingData2[,-75])
model.lasso <- lars(hdx, hdy)
Error in one %*% x : requires numeric/complex matrix/vector arguments

I realize that other methods may be easier or more appropriate, but the challenge is actually to do this using lars or glmnet, so if it's possible, I would appreciate any ideas or feedback.

Thank you,

like image 468
Alex Avatar asked Oct 21 '17 17:10

Alex


2 Answers

The other answers here point out ways to re-code your categorical factors as dummies. Depending on your application, it may not be a great solution. If all you care about is prediction, then this is probably fine, and the approach provided by Flo.P should be okay. LASSO will find you a useful set of variables, and you probably won't be over-fit.

However, if you're interested in interpreting your model or discussing which factors are important after the fact, you're in a weird spot. The default coding that model.matrix has very specific interpretations when taken by themselves. model.matrix uses what is referred to as "dummy coding". (I remember learning it as "reference coding"; see here for a summary.) That means that if one of these dummies is included, your model now has a parameter whose interpretation is "the difference between one level of this factor and an arbitrarily chosen other level of that factor". And maybe none of the other dummies for that factor were selected. You may also find that if the ordering of your factor levels changes, you end up with a different model.

There are ways to deal with this, but rather than cludge something together, I'd try the group lasso. Building on Flo.P's code above:

install.packages("gglasso")
library(gglasso)


create_factor <- function(nb_lvl, n= 100 ){
  factor(sample(letters[1:nb_lvl],n, replace = TRUE))}

df <- data.frame(var1 = create_factor(5), 
                 var2 = create_factor(5), 
                 var3 = create_factor(5), 
                 var4 = create_factor(5),
                 var5 = rnorm(100),
                 y = rnorm(100))

y <- df$y
x <- model.matrix( ~ ., dplyr::select(df, -y))[, -1]
groups <- c(rep(1:4, each = 4), 5)
fit <- gglasso(x = x, y = y, group = groups, lambda = 1)
fit$beta

So since we didn't specify a relationship between our factors (var1, var2, etc.) and y, the LASSO does a good job and sets all coefficients to 0 except when the minimum amount of regularization is applied. You can play around with values for lambda (a tuning parameter) or just leave the option blank and the function will pick a range for you.

like image 116
mavery Avatar answered Oct 23 '22 15:10

mavery


You can make dummy variables from your factor using model.matrix.

I create a data.frame. y is the target variable.

create_factor <- function(nb_lvl, n= 100 ){
  factor(sample(letters[1:nb_lvl],n, replace = TRUE))}

df <- data.frame(var1 = create_factor(5), 
           var2 = create_factor(5), 
           var3 = create_factor(5), 
           var4 = create_factor(5),
           var5 = rnorm(100),
           y = create_factor(2))


    # var1 var2 var3 var4        var5   y
    # 1    a    c    c    b -0.58655607 b
    # 2    d    a    e    a  0.52151994 a
    # 3    a    b    d    a -0.04792142 b
    # 4    d    a    a    d -0.41754957 b
    # 5    a    d    e    e -0.29887004 a

Select all the factor variables. I use dplyr::select_if then parse variables names to get an expression like y ~ var1 + var2 +var3 +var4

library(dplyr)
library(stringr)
library(glmnet)
vars_name <- df %>% 
  select(-y) %>% 
  select_if(is.factor) %>% 
  colnames() %>% 
  str_c(collapse = "+") 

model_string <- paste("y  ~",vars_name )

Create dummy variables with model.matrix. Don't forget the as.formula to coerce character to formula.

 x_train <- model.matrix(as.formula(model_string), df)

Fit your model.

 lasso_model <- cv.glmnet(x=x_train,y = df$y, family = "binomial", alpha=1, nfolds=10)

The code could be simplified. But the idea is here.

like image 2
Flo.P Avatar answered Oct 23 '22 15:10

Flo.P