Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between sample weight and class weight options in scikit learn?

I have class imbalance problem and want to solve this using cost sensitive learning.

  1. under sample and over sample
  2. give weights to class to use a modified loss function

Question

Scikit learn has 2 options called class weights and sample weights. Is sample weight actually doing option 2) and class weight options 1). Is option 2) the the recommended way of handling class imbalance.

like image 538
WonderWomen Avatar asked Sep 10 '15 03:09

WonderWomen


2 Answers

It's similar concepts, but with sample_weights you can force estimator to pay more attention on some samples, and with class_weights you can force estimator to learn with attention to some particular class. sample_weight=0 or class_weight=0 basically means that estimator doesn't need to take into consideration such samples/classes in learning process at all. Thus classifier (for example) will never predict some class if class_weight = 0 for this class. If some sample_weight/class_weight bigger than sample_weight/class_weight on other samples/classes - estimator will try to minimize error on that samples/classes in the first place. You can use user-defined sample_weights and class_weights simultaneously.

If you want to undersample/oversample your training set with simple cloning/removing - this will be equal to increasing/decreasing of corresponding sample_weights/class_weights.

In more complex cases you can also try artificially generate samples, with techniques like SMOTE.

like image 83
Ibraim Ganiev Avatar answered Oct 16 '22 11:10

Ibraim Ganiev


sample_weight and class_weight have a similar function, that is to make your estimator pay more attention to some samples.

Actual sample weights will be sample_weight * weights from class_weight.

This serves the same purpose as under/oversampling but the behavior is likely to be different: say you have an algorithm that randomly picks samples (like in random forests), it matters whether you oversampled or not.

To sum it up:
class_weight and sample_weight both do 2), option 2) is one way to handle class imbalance. I don't know of an universally recommended way, I would try 1), 2) and 1) + 2) on your specific problem to see what works best.

like image 8
ldirer Avatar answered Oct 16 '22 11:10

ldirer