I have a logistic regression and a random forest and I'd like to combine them (ensemble) for the final classification probability calculation by taking an average.
Is there a built-in way to do this in sci-kit learn? Some way where I can use the ensemble of the two as a classifier itself? Or would I need to roll my own classifier?
The simplest way of combining classifier output is to allow each classifier to make its own prediction and then choose the plurality prediction as the “final” output. This simple voting scheme is easy to implement and easy to understand, but it does not always produce the best possible results.
A simple way to achieve this is to split your training set in half. Use the first half of your training data to train the level one classifiers. Then use the trained level one classifiers to make predictions on the second half of the training data. These predictions should then be used to train meta-classifier.
An extra-trees classifier. This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Stacking refers to a method to blend estimators. In this strategy, some estimators are individually fitted on some training data while a final estimator is trained using the stacked predictions of these base estimators.
The key to understanding how to fine tune classifiers in scikit-learn is to understand the methods .predict_proba () and .decision_function (). These return the raw probability that a sample is predicted to be in a class. This is an important distinction from the absolute class predictions returned by calling the .predict () method.
A comparison of a several classifiers in scikit-learn on synthetic datasets. The point of this example is to illustrate the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition conveyed by these examples does not necessarily carry over to real datasets.
Scikit-Learn provides easy access to numerous different classification algorithms. Among these classifiers are: K-Nearest Neighbors. Support Vector Machines. Decision Tree Classifiers/Random Forests. Naive Bayes. Linear Discriminant Analysis. Logistic Regression.
After the classifier model has been trained on the training data, it can make predictions on the testing data. This is easily done by calling the predict command on the classifier and providing it with the parameters it needs to make predictions about, which are the features in your testing dataset:
NOTE: The scikit-learn Voting Classifier is probably the best way to do this now
OLD ANSWER:
For what it's worth I ended up doing this as follows:
class EnsembleClassifier(BaseEstimator, ClassifierMixin): def __init__(self, classifiers=None): self.classifiers = classifiers def fit(self, X, y): for classifier in self.classifiers: classifier.fit(X, y) def predict_proba(self, X): self.predictions_ = list() for classifier in self.classifiers: self.predictions_.append(classifier.predict_proba(X)) return np.mean(self.predictions_, axis=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With