How does 'max_samples' keyword for a Bagging classifier effect the number of samples being used for each of the base estimators?

Tags:

I want to understand how max_samples value for a Bagging classifier effects the number of samples being used for each of the base estimators.

This is the GridSearch output:

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, spl... n_estimators=100, n_jobs=-1, oob_score=False,
         random_state=1, verbose=2, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [0.6, 0.8, 1.0], 'max_samples': [0.6, 0.8, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)

Here I am finding out what the best params were:

print gs5.best_score_, gs5.best_params_
0.828282828283 {'max_features': 0.6, 'max_samples': 1.0}

Now I am picking out the best grid search estimator and trying to see the number of samples that specific Bagging classifier used in its set of 100 base decision tree estimators.

val=[]
for i in np.arange(100):
    x = np.bincount(gs5.best_estimator_.estimators_samples_[i])[1]
    val.append(x)
print np.max(val)
print np.mean(val), np.std(val)

587
563.92 10.3399032877

Now, the size of training set is 891. Since CV is 5, 891 * 0.8 = 712.8 should go into each Bagging classifier evaluation, and since max_samples is 1.0, 891 * 0.5 * 1.0 = 712.8 should be the number of samples per each base estimator, or something close to it?

So, why is the number in the range 564 +/- 10, and maximum value 587, when as per calculation, it should be close to 712 ? Thanks.

707

asked Aug 04 '16 15:08

hkhare

1 Answers

After doing more research, I think I've figured out what's going on. GridSearchCV uses cross-validation on the training data to determine the best parameters, but the estimator it returns is fit on the entire training set, not one of the CV-folds. This makes sense because more training data is usually better.

So, the BaggingClassifier you get back from GridSearchCV is fit to the full dataset of 891 data samples. It's true then, that with max_sample=1., each base estimator will randomly draw 891 samples from the training set. However, by default samples are drawn with replacement, so the number of unique samples will be less than the total number of samples due to duplicates. If you want to draw without replacement, set the bootstrap keyword of BaggingClassifier to false.

Now, exactly how close should we expect the number of distinct samples to be to the size of the dataset when drawing without replacement?

Based off this question, the expected number of distinct samples when drawing n samples with replacement from a set of n samples is n * (1-(n-1)/n) ^ n. When we plug 891 into this, we get

>>> 891 * (1.- (890./891)**891)
563.4034437025824

The expected number of samples (563.4) is very close to your observed mean (563.8), so it appears that nothing abnormal is going on.

answered Apr 28 '23 12:04

bpachev

Related questions
                            
                                Python: Identifying undulating patterns in 1d distribution
                            
                                How vectorizer fit_transform work in sklearn?
                            
                                Machine Learning: normalize target var based on the impact of independent var
                            
                                Q-values exploding when training DQN
                            
                                How to decide threshold value in SelectFromModel() for selecting features?
                            
                                Add class information to Generator model in keras
                            
                                How do I train gpt 2 from scratch?
                            
                                Constraining a neural network's output to be within an arbitrary range
                            
                                One stage vs two stage object detection
                            
                                libsvm predict method confusion
                            
                                Using my own corpus for category classification in Python NLTK
                            
                                Testing rules generated by Rpart package
                            
                                Determining optimal number of clusters and Davies–Bouldin Index?
                            
                                Cascade Classifiers for Multiclass Problems in scikit-learn
                            
                                Is it possible to run Python's scikit-learn algorithms over Hadoop? [closed]
                            
                                Clustering a billion items (or which clustering methods run in linear time?)
                            
                                Implementing gradient descent for multiple variables in Octave using "sum"
                            
                                How to explain the outcome of k-means clustering?
                            
                                How do I get TensorFlow's 'import_graph_def' to return Tensors
                            
                                How to hstack several sparse matrices (feature matrices)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does 'max_samples' keyword for a Bagging classifier effect the number of samples being used for each of the base estimators?

Tags:

machine-learning

scikit-learn

grid-search

hkhare

People also ask

1 Answers

bpachev

Recent Activity

Donate For Us