Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build voting classifier in sklearn when the individual classifiers are being fit with different datasets?

I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).

I want to fit 3 different VotingClassifiers on each resampled dataset, and then combine the results of the individual models using another VotingClassifier (or similar). I know that building a single voting classifier looks like this:

# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()

voting_clf_1 = VotingClassifier(
    estimators = [
        ('rf', rnd_clf_1), 
        ('xgb', xgb_clf_1),
    ],
    voting='soft'
)

# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)

But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba() method on each of the models and then "manually" averages the individual probabilities.

But... is there a better way?

# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)

Thanks!

like image 666
rv123 Avatar asked Nov 07 '22 21:11

rv123


1 Answers

Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier on each sample dataset.

There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.

You can feed the estimator as VotingClassifier and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy param to mention proportion of downsampling which you want on Majority class.

Working Example:

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)

rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()

voting_clf_1 = VotingClassifier(
    estimators = [
        ('rf', rnd_clf_1), 
        ('xgb', xgb_clf_1),
    ],
    voting='soft'
)

bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS

y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))

See here. May be you can reuse _predict_proba() and _collect_probas() functions after fitting your estimators manually.

like image 73
Venkatachalam Avatar answered May 12 '23 17:05

Venkatachalam