I am trying to solve a binary classification problem with a class imbalance. I have a dataset of 210,000 records in which 92 % are 0s and 8% are 1s. I am using sklearn (v 0.16)
in python
for random forests
.
I see there are two parameters sample_weight
and class_weight
while constructing the classifier. I am currently using the parameter class_weight="auto"
.
Am I using this correctly? What does class_weight and sample weight actually do and What should I be using ?
Class weights are what you should be using.
Sample weights allow you to specify a multiplier for the impact a particular sample has. Weighting a sample with a weight of 2.0 roughly has the same effect as if the point was present twice in the data (although the exact effect is estimator dependent).
Class weights have the same effect, but it used for applying a set multiplier to every sample that falls into the specified class. In terms of functionality, you could use either, but class_weights
is provided for convenience so you do not have to manually weight each sample. Also it is possible to combined the usage of the two in which the class weights are multiplied by the sample weights.
One of the main uses for sample_weights
on the fit()
method is to allow boosting meta-algorithms like AdaBoostClassifier
to operate on existing decision tree classifiers and increase or decrease the weights of individual samples as needed by the algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With