Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to use classwt in randomForest of R?

I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomForest for making predictions of the target class.

So, to reduce the class imbalance, I played with sampsize parameter, setting it to c(5000, 1000, 1000, 50) and some other values, but there was not much use of it. Actually, the accuracy of the 1st class decreased while I played with sampsize, though the improvement in other class predictions was very minute.

While digging through the archives, I came across two more features of randomForest(), which are strata and classwt that are used to offset class imbalance issue.

All the documents upon classwt were old (generally belonging to the 2007, 2008 years), which all suggested not the use the classwt feature of randomForest package in R as it does not completely implement its complete functionality like it does in fortran. So the first question is:
Is classwt completely implemented now in randomForest package of R? If yes, what does passing c(1, 10, 10, 10) to the classwt argument represent? (Assuming the above case of 4 classes in the target variable)

Another argument which is said to offset class imbalance issue is stratified sampling, which is always used in conjunction with sampsize. I understand what sampsize is from the documentation, but there is not enough documentation or examples which gave a clear insight into using strata for overcoming class imbalance issue. So the second question is:
What type of arguments have to be passed to stratain randomForest and what does it represent?

I guess the word weight which I have not explicitly mentioned in the question should play a major role in the answer.

like image 382
StrikeR Avatar asked Nov 27 '13 19:11

StrikeR


People also ask

How do I speed up randomForest in R?

If you wish to speed up your random forest, lower the number of estimators. If you want to increase the accuracy of your model, increase the number of trees. Specify the maximum number of features to be included at each node split.

How do you assign class weights in random forest?

Random Forest With Bootstrap Class Weighting As such, it might be interesting to change the class weighting based on the class distribution in each bootstrap sample, instead of the entire training dataset. This can be achieved by setting the class_weight argument to the value 'balanced_subsample'.

Can random forest handle imbalanced data?

Random forest is an ideal algorithm to deal with the extreme imbalance owing to two main reasons. Firstly, the ability to incorporate class weights into the random forest classifier makes it cost-sensitive; hence it penalizes misclassifying the minority class.

What is randomForest package in R?

randomForest implements Breiman's random forest algorithm (based on Breiman and Cutler's original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.


1 Answers

classwt is correctly passed on to randomForest, check this example:

library(randomForest) rf = randomForest(Species~., data = iris, classwt = c(1E-5,1E-5,1E5)) rf  #Call: # randomForest(formula = Species ~ ., data = iris, classwt = c(1e-05, 1e-05, 1e+05))  #               Type of random forest: classification #                     Number of trees: 500 #No. of variables tried at each split: 2 # #        OOB estimate of  error rate: 66.67% #Confusion matrix: #           setosa versicolor virginica class.error #setosa          0          0        50           1 #versicolor      0          0        50           1 #virginica       0          0        50           0 

Class weights are the priors on the outcomes. You need to balance them to achieve the results you want.


On strata and sampsize this answer might be of help: https://stackoverflow.com/a/20151341/2874779

In general, sampsize with the same size for all classes seems reasonable. strata is a factor that's going to be used for stratified resampling, in your case you don't need to input anything.

like image 62
catastrophic-failure Avatar answered Oct 02 '22 18:10

catastrophic-failure