Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use sample_weight parameter for algorithms in sklearn

I've a very imbalanced dataset and I'm performing a classification task. So i've tried all the algorithms i.e (Decision Trees, Naive Bayes, Logistic Regression) and for each of them I've come across a parameter called sample_weights in scikit learn.

Assume in my dataset I've around 100k positive data points and 20k negative data points.
i.e 0.83 % of positive labels and 0.16 % of negative labels

From the docs I assume this parameter is used to tackle such issue by giving more weightage to class with less data points i.e imbalaned dataset.

class_weight : dict or ‘balanced’, default: None

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

My question is what should be my ideal class_weights for the above imbalanced dataset example such that I could avoid techniques like oversampling or undersampling?

like image 900
user_6396 Avatar asked Oct 21 '25 11:10

user_6396


1 Answers

The weights should be set to balanced so that the classes are trained as if they were balanced.

Class weights are equivalent to random oversampling. In my opinion, intelligent oversampling techniques such as SMOTE is a more efficient method over the the method of adding weights to samples during training.

However, oversampling techniques have an added computation cost, because the model needs to be trained on a larger dataset (due to oversampling). Class weighting, on the other hand has no additional computation cost to the model. Unless training a very computationally expensive model, I usually prefer SMOTE.

like image 97
Djib2011 Avatar answered Oct 24 '25 00:10

Djib2011



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!