Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RF: high OOB accuracy by one class and very low accuracy by the other, with big class imbalance

I am new to random forest classifier. I am using it to classify a dataset that has two classes. - The number of features is 512. - The proportion of the data is 1:4. I.e, 75% of the data is from the first class and 25% of the second one. - I am using 500 trees.

The classifier produces an out of bag error of 21.52%. The per class error for the first class (which is represented by 75% of the training data) is 0.0059. While the classification error for the second class is really high: 0.965.

I am looking for an explanation for this behaviour and if you have suggestion to improve the accuracy for the second class.

I am looking forwards to your help.

Thanks

In forgot to say that I'm using R and that I used nodesize of 1000 in the above test.

Here I repeated the training with only 10 trees and nodesize= 1 (just to give an idea) and below is the function call in R and the confusion matrix:

  • randomForest(formula = Label ~ ., data = chData30PixG12, ntree = 10,importance = TRUE, nodesize = 1, keep.forest = FALSE, do.trace = 50)
  • Type of random forest: classification

  • Number of trees: 10

  • No. of variables tried at each split: 22

  • OOB estimate of error rate: 24.46%

  • Confusion matrix:

  • Irrelevant , Relevant , class.error

  • Irrelevant 37954 , 4510 , 0.1062076
  • Relevant 8775 , 3068 , 0.7409440
like image 942
user1354770 Avatar asked Nov 29 '22 17:11

user1354770


1 Answers

I agree with @usr that generally speaking when you see a Random Forest simply classifying (nearly) each observation as the majority class, this means that your features don't provide much information to distinguish the two classes.

One option is to run the Random Forest such that you over-sample observations from the minority class (rather than sampling with replacement from the entire data set). So you might specify that each tree is built on a sample of size N where you force N/2 of the observations to come from each class (or some other ratio of your choosing).

While that might help some, it is by no means a cure-all. It's might be more likely that you'll get more mileage out of finding better features that do a good job of distinguishing the classes than tweaking the RF settings.

like image 54
joran Avatar answered Dec 01 '22 06:12

joran