Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn: Random forest class_weight and sample_weight parameters

Tags:

I have a class imbalance problem and been experimenting with a weighted Random Forest using the implementation in scikit-learn (>= 0.16).

I have noticed that the implementation takes a class_weight parameter in the tree constructor and sample_weight parameter in the fit method to help solve class imbalance. Those two seem to be multiplied though to decide a final weight.

I have trouble understanding the following:

  • In what stages of the tree construction/training/prediction are those weights used? I have seen some papers for weighted trees, but I am not sure what scikit implements.
  • What exactly is the difference between class_weight and sample_weight?
like image 256
user36047 Avatar asked Jun 12 '15 14:06

user36047


People also ask

What are the parameters of Random Forest?

(The parameters of a random forest are the variables and thresholds used to split each node learned during training). Scikit-Learn implements a set of sensible default hyperparameters for all models, but these are not guaranteed to be optimal for a problem.

What is Sample_weight in Sklearn?

sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.

Can Random Forest handle categorical variables?

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems.


Video Answer


1 Answers

RandomForests are built on Trees, which are very well documented. Check how Trees use the sample weighting:

  • User guide on decision trees - tells exactly what algorithm is used
  • Decision tree API - explains how sample_weight is used by trees (which for random forests, as you have determined, is the product of class_weight and sample_weight).

As for the difference between class_weight and sample_weight: much can be determined simply by the nature of their datatypes. sample_weight is 1D array of length n_samples, assigning an explicit weight to each example used for training. class_weight is either a dictionary of each class to a uniform weight for that class (e.g., {1:.9, 2:.5, 3:.01}), or is a string telling sklearn how to automatically determine this dictionary.

So the training weight for a given example is the product of it's explicitly named sample_weight (or 1 if sample_weight is not provided), and it's class_weight (or 1 if class_weight is not provided).

like image 137
Andreus Avatar answered Oct 25 '22 17:10

Andreus