Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to know if a regression model generated by random forests is good? ( MSE and %Var(y)) [closed]

I tried to use random forests for regression. The original data is a data frame of 218 rows and 9 columns. The first 8 columns are categorical values ( can be either A, B, C, or D), and the last column V9 has numerical values that can go from 10.2 to 999.87.

When I used random forests on a training set, which represents 2/3 of the original data and which is randomly selected, I got the following results.

>r=randomForest(V9~.,data=trainingData,mytree=4,ntree=1000,importance=TRUE,do.trace=100)
       |      Out-of-bag   |
  Tree |      MSE  %Var(y) |
   100 | 6.927e+04    98.98 |
   200 | 6.874e+04    98.22 |
   300 | 6.822e+04    97.48 |
   400 | 6.812e+04    97.34 |
   500 | 6.839e+04    97.73 |
   600 | 6.852e+04    97.92 |
   700 | 6.826e+04    97.54 |
   800 | 6.815e+04    97.39 |
   900 | 6.803e+04    97.21 |
  1000 | 6.796e+04    97.11 |

I do not know if the high variance percentage means that the model is good or not. Also, since MSE is high, I suspect that the regression model is not really good. Any idea about how to read the results above? Do they mean that the model is not good?

like image 533
John Avatar asked Dec 27 '22 03:12

John


1 Answers

Like @Joran told, %Var is the amount of total variance of Y explained by your random forest model. After the adjust, apply the model to your validation data (1/3 remain):

RFestimated = predict(r, data=ValidationData)

It is interesting also to check the residual:

qqnorm((RFestimated - ValidationData$V9)/sd(RFestimated-ValidationData$V9))

qqline((RFestimated-ValidationData$V9)/sd(RFestimated-ValidationData$V9))

the estimated versus observed values:

plot(ValidationData$V9, RFestimated)

and the RMSE:

RMSE <- (sum((RFestimated-ValidationData$V9)^2)/length(Validation$v9))^(1/2)

I hope this help!

like image 100
Gorgens Avatar answered Jan 03 '23 09:01

Gorgens