Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does 'max_samples' keyword for a Bagging classifier effect the number of samples being used for each of the base estimators?

I want to understand how max_samples value for a Bagging classifier effects the number of samples being used for each of the base estimators.

This is the GridSearch output:

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, spl... n_estimators=100, n_jobs=-1, oob_score=False,
         random_state=1, verbose=2, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [0.6, 0.8, 1.0], 'max_samples': [0.6, 0.8, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)

Here I am finding out what the best params were:

print gs5.best_score_, gs5.best_params_
0.828282828283 {'max_features': 0.6, 'max_samples': 1.0}

Now I am picking out the best grid search estimator and trying to see the number of samples that specific Bagging classifier used in its set of 100 base decision tree estimators.

val=[]
for i in np.arange(100):
    x = np.bincount(gs5.best_estimator_.estimators_samples_[i])[1]
    val.append(x)
print np.max(val)
print np.mean(val), np.std(val)

587
563.92 10.3399032877

Now, the size of training set is 891. Since CV is 5, 891 * 0.8 = 712.8 should go into each Bagging classifier evaluation, and since max_samples is 1.0, 891 * 0.5 * 1.0 = 712.8 should be the number of samples per each base estimator, or something close to it?

So, why is the number in the range 564 +/- 10, and maximum value 587, when as per calculation, it should be close to 712 ? Thanks.

like image 707
hkhare Avatar asked Aug 04 '16 15:08

hkhare


People also ask

What is the base estimator for the bagging classifier?

A Bagging classifier. A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.

What is Max_samples?

max_sample: The number of samples to draw from X to train each base estimator. max_features: number of features drawn from the dataset.

What are Hyperparameters in bagging?

An important hyperparameter for the Bagging algorithm is the number of decision trees used in the ensemble. Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case.

When would you use a bagging classifier?

It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.


1 Answers

After doing more research, I think I've figured out what's going on. GridSearchCV uses cross-validation on the training data to determine the best parameters, but the estimator it returns is fit on the entire training set, not one of the CV-folds. This makes sense because more training data is usually better.

So, the BaggingClassifier you get back from GridSearchCV is fit to the full dataset of 891 data samples. It's true then, that with max_sample=1., each base estimator will randomly draw 891 samples from the training set. However, by default samples are drawn with replacement, so the number of unique samples will be less than the total number of samples due to duplicates. If you want to draw without replacement, set the bootstrap keyword of BaggingClassifier to false.

Now, exactly how close should we expect the number of distinct samples to be to the size of the dataset when drawing without replacement?

Based off this question, the expected number of distinct samples when drawing n samples with replacement from a set of n samples is n * (1-(n-1)/n) ^ n. When we plug 891 into this, we get

>>> 891 * (1.- (890./891)**891)
563.4034437025824

The expected number of samples (563.4) is very close to your observed mean (563.8), so it appears that nothing abnormal is going on.

like image 58
bpachev Avatar answered Apr 28 '23 12:04

bpachev