Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does sample_weight compare to class_weight in scikit-learn?

I would like to use sklearn.ensemble.GradientBoostingClassifier on an imbalanced classification problem. I intend to optimize for Area Under the Receiver Operating Characteristic Curve (ROC AUC). For this I would like to reweight my classes to make the small class more important to the classifier.

This would normally be done (in RandomForestClassifier for example) by setting class_weight = “balanced” but there is no such parameter in the GradientBoostingClassifier.

The documentation says:

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

If y_train is my dataframe of my target with elements in {0,1}, then the documentation implies that this should reproduce the same as class_weight = “balanced”

sample_weight = y_train.shape[0]/(2*np.bincount(y_train))
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train,sample_weight = sample_weight[y_train.values])

Is this correct or am I missing something?

like image 355
Keith Avatar asked Nov 20 '17 19:11

Keith


People also ask

What is Sample_weight in Sklearn?

sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.

What is Class_weight =' balanced?

with class_weight="balanced" you capture more true events (higher TRUE recall) but also you are more likely to get false alerts (lower TRUE precision) as a result, the total % TRUE might be higher than actual because of all the false positives.

What is Class_weight in logistic regression?

The LogisticRegression class provides the class_weight argument that can be specified as a model hyperparameter. The class_weight is a dictionary that defines each class label (e.g. 0 and 1) and the weighting to apply in the calculation of the negative log likelihood when fitting the model.

What is Sample_weight in keras?

sample_weight: Optional array of the same length as x, containing weights to apply to the model's loss for each sample. In the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample.


1 Answers

I would suggest you use the class_weight.compute_sample_weight utility in scikit-learn. For example:

from sklearn.utils.class_weight import compute_sample_weight
y = [1,1,1,1,0,0,1]
compute_sample_weight(class_weight='balanced', y=y)

Output:

array([ 0.7 ,  0.7 ,  0.7 ,  0.7 ,  1.75,  1.75,  0.7 ])

You can use this as input to the sample_weight keyword.

like image 184
KPLauritzen Avatar answered Oct 24 '22 19:10

KPLauritzen