I want to understand how max_samples value for a Bagging classifier effects the number of samples being used for each of the base estimators.
This is the GridSearch output:
GridSearchCV(cv=5, error_score='raise',
estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1, spl... n_estimators=100, n_jobs=-1, oob_score=False,
random_state=1, verbose=2, warm_start=False),
fit_params={}, iid=True, n_jobs=-1,
param_grid={'max_features': [0.6, 0.8, 1.0], 'max_samples': [0.6, 0.8, 1.0]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)
Here I am finding out what the best params were:
print gs5.best_score_, gs5.best_params_
0.828282828283 {'max_features': 0.6, 'max_samples': 1.0}
Now I am picking out the best grid search estimator and trying to see the number of samples that specific Bagging classifier used in its set of 100 base decision tree estimators.
val=[]
for i in np.arange(100):
x = np.bincount(gs5.best_estimator_.estimators_samples_[i])[1]
val.append(x)
print np.max(val)
print np.mean(val), np.std(val)
587
563.92 10.3399032877
Now, the size of training set is 891. Since CV is 5, 891 * 0.8 = 712.8 should go into each Bagging classifier evaluation, and since max_samples is 1.0, 891 * 0.5 * 1.0 = 712.8 should be the number of samples per each base estimator, or something close to it?
So, why is the number in the range 564 +/- 10, and maximum value 587, when as per calculation, it should be close to 712 ? Thanks.
A Bagging classifier. A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.
max_sample: The number of samples to draw from X to train each base estimator. max_features: number of features drawn from the dataset.
An important hyperparameter for the Bagging algorithm is the number of decision trees used in the ensemble. Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case.
It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.
After doing more research, I think I've figured out what's going on. GridSearchCV uses cross-validation on the training data to determine the best parameters, but the estimator it returns is fit on the entire training set, not one of the CV-folds. This makes sense because more training data is usually better.
So, the BaggingClassifier you get back from GridSearchCV is fit to the full dataset of 891 data samples. It's true then, that with max_sample=1., each base estimator will randomly draw 891 samples from the training set. However, by default samples are drawn with replacement, so the number of unique samples will be less than the total number of samples due to duplicates. If you want to draw without replacement, set the bootstrap keyword of BaggingClassifier to false.
Now, exactly how close should we expect the number of distinct samples to be to the size of the dataset when drawing without replacement?
Based off this question, the expected number of distinct samples when drawing n samples with replacement from a set of n samples is n * (1-(n-1)/n) ^ n. When we plug 891 into this, we get
>>> 891 * (1.- (890./891)**891)
563.4034437025824
The expected number of samples (563.4) is very close to your observed mean (563.8), so it appears that nothing abnormal is going on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With