I am new to random forest classifier. I am using it to classify a dataset that has two classes. - The number of features is 512. - The proportion of the data is 1:4. I.e, 75% of the data is from the first class and 25% of the second one. - I am using 500 trees.
The classifier produces an out of bag error of 21.52%. The per class error for the first class (which is represented by 75% of the training data) is 0.0059. While the classification error for the second class is really high: 0.965.
I am looking for an explanation for this behaviour and if you have suggestion to improve the accuracy for the second class.
I am looking forwards to your help.
Thanks
In forgot to say that I'm using R and that I used nodesize of 1000 in the above test.
Here I repeated the training with only 10 trees and nodesize= 1 (just to give an idea) and below is the function call in R and the confusion matrix:
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 22
OOB estimate of error rate: 24.46%
Confusion matrix:
Irrelevant , Relevant , class.error
I agree with @usr that generally speaking when you see a Random Forest simply classifying (nearly) each observation as the majority class, this means that your features don't provide much information to distinguish the two classes.
One option is to run the Random Forest such that you over-sample observations from the minority class (rather than sampling with replacement from the entire data set). So you might specify that each tree is built on a sample of size N where you force N/2 of the observations to come from each class (or some other ratio of your choosing).
While that might help some, it is by no means a cure-all. It's might be more likely that you'll get more mileage out of finding better features that do a good job of distinguishing the classes than tweaking the RF settings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With