Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does negative %IncMSE in RandomForest package mean?

I used RandomForest for a regression problem. I used importance(rf,type=1) to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this mean that this variable is bad for the model? I searched the Internet to get some answers but I didn't find a clear one. I also found something strange in the model's summary ( attached below), It seems that only one tree was used although I defined ntrees as 800.

model:

rf<-randomForest(var1~va2+var3+..+var35,data=d7depo,ntree=800,keep.forest=FALSE, importance=TRUE)

summary(rf)
                Length Class  Mode     
call                6  -none- call     
type                1  -none- character
predicted       26917  -none- numeric  
mse               800  -none- numeric  
rsq               800  -none- numeric  
oob.times       26917  -none- numeric  
importance         70  -none- numeric  
importanceSD       35  -none- numeric  
localImportance     0  -none- NULL     
proximity           0  -none- NULL     
ntree               1  -none- numeric  
mtry                1  -none- numeric  
forest              0  -none- NULL     
coefs               0  -none- NULL     
y               26917  -none- numeric  
test                0  -none- NULL     
inbag               0  -none- NULL     
terms               3  terms  call 
like image 797
mql4beginner Avatar asked Jan 13 '15 09:01

mql4beginner


1 Answers

Question 1 - why does ntree show 1?:

summary(rf) shows you the length of the objects that are included in your rf variable. That means that rf$ntree is of length 1. If you type on your console rf$tree you will see that it shows 800.

Question 2 - does a negative %IncMSE show a "bad" variable?

IncMSE:
The way this is calculated is by computing the MSE of the whole model initially. Let's call this MSEmod. After this for each one of the variables (columns in your data set) the values are randomly shuffled (permuted) so that a "bad" variable is being created and a new MSE is being calculated. I.e. imagine for that for one column you had rows 1,2,3,4,5. After the permutation these will end up being 4,3,1,2,5 at random. After the permutation (all of the other columns remain exactly the same since we want to examine col1's importance), the new MSE of the model is being calculated, let's call it MSEcol1 (in a similar manner you will have MSEcol2, MSEcol3 but let's keep it simple and only deal with MSEcol1 here). We would expect that since the second MSE was created using a variable completely random, MSEcol1 would be higher than MSEmod (the higher the MSE the worse). Therefore, when we take the difference of the two MSEcol1 - MSEmod we usually expect a positive number. In your case a negative number shows that the random variable worked better, which shows that it probably the variable is not predictive enough i.e. not important.

Keep in mind that this description I gave you is the high level, in reality the two MSE values are scaled and the percentage difference is being calculated. But the high level story is this.

In algorithm form:

  1. Compute model MSE
  2. For each variable in the model:
    • Permute variable
    • Calculate new model MSE according to variable permutation
    • Take the difference between model MSE and new model MSE
  3. Collect the results in a list
  4. Rank variables' importance according to the value of the %IncMSE. The greater the value the better

Hope it is clear now!

like image 106
LyzandeR Avatar answered Sep 21 '22 20:09

LyzandeR