Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

randomForest() machine learning in R

I am exploring with the function randomforest() in R and several articles I found all suggest using a similar logic as below, where the response variable is column 30 and independent variables include everthing else except for column 30:

dat.rf <- randomForest(dat[,-30], 
                      dat[,30], 
                      proximity=TRUE, 
                      mtry=3,
                      importance=TRUE,
                      do.trace=100,
                      na.action = na.omit)

When I try this, I got the following error messages:

Error in randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, : NA not permitted in predictors In addition: Warning message: In randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, : The response has five or fewer unique values. Are you sure you want to do regression?

However, I was able to get it to work when I listed the independent variables one by one while keeping all the other parameters the same.

dat.rf <- randomForest(as.factor(Y) ~X1+ X2+ X3+ X4+ X5+ X6+ X7+ X8+ X9+ X10+......,                          
                      data=dat
                      proximity=TRUE,
                      mtry=3,
                      importance=TRUE,
                      do.trace=100,
                      na.action = na.omit)

Could someone help me debug the simplier command where I don't have to list each predictor one by one?

like image 531
user3521568 Avatar asked Feb 06 '26 14:02

user3521568


1 Answers

The error message gives you a clue to two problems:

  1. First, you need to remove any row that has a NA anywhere. Removing NA should be easy enough and I'll leave you that one as an exercise.
  2. It looks like you need to do classification (which predicts a response which only has one of a few discrete levels), rather than regression (which predicts a continuous response). If the response is continuous, randomForest() will automatically apply regression.

So, how do you force randomForest() to use classification?As you noticed in your first try, randomForest allows you to give data as predictors and response data, not just using the formula style. To force randomForest() to apply classification, make sure that the value you are trying to predict (the response, or dat[,30]) is a factor. Remember to explicitly identify the $x$ and $y$ arguments. This is easy to do:

 randomForest(x = dat[,-30],
              y = factor(dat[,30]),
              ...)

This way your output can only take one of the levels given in y.

This is all buried in the description of the arguments $x$ and $y$: see ?help.

like image 81
Andy Clifton Avatar answered Feb 09 '26 06:02

Andy Clifton



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!