Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unbalanced classification using RandomForestClassifier in sklearn

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

like image 937
mlo Avatar asked Nov 19 '13 21:11

mlo


People also ask

Can random forest deal with unbalanced data?

Random forest is an ideal algorithm to deal with the extreme imbalance owing to two main reasons. Firstly, the ability to incorporate class weights into the random forest classifier makes it cost-sensitive; hence it penalizes misclassifying the minority class.

Does class imbalance affect decision tree?

The decision tree algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets. The split points of the tree are chosen to best separate examples into two groups with minimum mixing.

Is random forest better than bagging?

Due to the random feature selection, the trees are more independent of each other compared to regular bagging, which often results in better predictive performance (due to better variance-bias trade-offs), and I'd say that it's also faster than bagging, because each tree learns only from a subset of features.


1 Answers

You can pass sample weights argument to Random Forest fit method

sample_weight : array-like, shape = [n_samples] or None 

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.

Update

Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple

sample_weight = np.array([5 if i == 0 else 1 for i in y]) 

assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.

like image 53
alko Avatar answered Oct 11 '22 13:10

alko