I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).
I want to fit 3 different VotingClassifiers
on each resampled dataset, and then combine the results of the individual models using another VotingClassifier
(or similar). I know that building a single voting classifier looks like this:
# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)
But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba()
method on each of the models and then "manually" averages the individual probabilities.
But... is there a better way?
# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)
Thanks!
Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier
on each sample dataset.
There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.
You can feed the estimator as VotingClassifier
and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy
param to mention proportion of downsampling which you want on Majority class.
Working Example:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
See here. May be you can reuse _predict_proba()
and _collect_probas()
functions after fitting your estimators manually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With