Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Balanced Random Forest in scikit-learn (python)

I'm wondering if there is an implementation of the Balanced Random Forest (BRF) in recent versions of the scikit-learn package. BRF is used in the case of imbalanced data. It works as normal RF, but for each bootstrapping iteration, it balances the prevalence class by undersampling. For example, given two classes N0 = 100, and N1 = 30 instances, at each random sampling it draws (with replacement) 30 instances from the first class and the same amount of instances from the second class, i.e. it trains a tree on a balanced data set. For more information please refer to this paper.

RandomForestClassifier() does have the 'class_weight=' parameter, which might be set to 'balanced', but I'm not sure that it is related to downsampling of the bootsrapped training samples.

like image 563
Arnold Klein Avatar asked Nov 12 '16 17:11

Arnold Klein


People also ask

How does balanced random forest work?

Balanced Random Forest (BRF) To overcome this limitation, it is crucial to make class priors equal, either by downsampling or oversampling. Hence, BRF does this by iteratively drawing a bootstrap sample with equal proportions of data points from both the minority and the majority class.

Can random forest handle imbalanced data?

Again, random forest is very effective on a wide range of problems, but like bagging, performance of the standard algorithm is not great on imbalanced classification problems.

What is the Randomforestclassifier model in Sklearn?

A random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.


1 Answers

What you're looking for is the BalancedBaggingClassifier from imblearn.

imblearn.ensemble.BalancedBaggingClassifier(base_estimator=None,
 n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,
 bootstrap_features=False, oob_score=False, warm_start=False, ratio='auto',
 replacement=False, n_jobs=1, random_state=None, verbose=0)

Effectively what it allow you to do is to successively undersample your majority class while fitting an estimator on top. You can use random forest or any base estimator from scikit-learn. Here is an example.

like image 131
mamafoku Avatar answered Sep 28 '22 17:09

mamafoku