I've read from the relevant documentation that :
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (
sample_weight
) for each class to the same value.
But, it is still unclear to me how this works. If I set sample_weight
with an array of only two possible values, 1
's and 2
's, does this mean that the samples with 2
's will get sampled twice as often as the samples with 1
's when doing the bagging? I cannot think of a practical example for this.
sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.
sample_weight: Optional array of the same length as x, containing weights to apply to the model's loss for each sample. In the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample.
Decision tree is a type of supervised learning algorithm that can be used for both regression and classification problems. The algorithm uses training data to create rules that can be represented by a tree structure. Like any other tree representation, it has a root node, internal nodes, and leaf nodes.
Some quick preliminaries:
Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:
Pr(Class=k) = #(examples of class k in region) / #(total examples in region)
The impurity measure takes as input, the array of class probabilities:
[Pr(Class=1), Pr(Class=2), ..., Pr(Class=K)]
and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is 2*p*(1-p)
, where p = Pr(Class=1)
and 1-p=Pr(Class=2)
.
Now, basically the short answer to your question is:
sample_weight
augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.
I believe this is best illustrated through example.
First consider the following 2-class problem where the inputs are 1 dimensional:
from sklearn.tree import DecisionTreeClassifier as DTC X = [[0],[1],[2]] # 3 simple training examples Y = [ 1, 2, 1 ] # class labels dtc = DTC(max_depth=1)
So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.
sample_weight
dtc.fit(X,Y) print dtc.tree_.threshold # [0.5, -2, -2] print dtc.tree_.impurity # [0.44444444, 0, 0.5]
The first value in the threshold
array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values in threshold
are placeholders and are to be ignored. The impurity
array tells us the computed impurity values in the parent, left, and right nodes respectively.
In the parent node, p = Pr(Class=1) = 2. / 3.
, so that gini = 2*(2.0/3.0)*(1.0/3.0) = 0.444....
. You can confirm the child node impurities as well.
sample_weight
Now, let's try:
dtc.fit(X,Y,sample_weight=[1,2,3]) print dtc.tree_.threshold # [1.5, -2, -2] print dtc.tree_.impurity # [0.44444444, 0.44444444, 0.]
You can see the feature threshold is different. sample_weight
also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.
The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:
p = Pr(Class=1) = (1+3) / (1+2+3) = 2.0/3.0
The gini measure of 4/9
follows.
Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be 4/9
also in the left child node because:
p = Pr(Class=1) = 1 / (1+2) = 1/3.
The impurity of zero in the right child is due to only one training example lying in that region.
You can extend this with non-integer sample-wights similarly. I recommend trying something like sample_weight = [1,2,2.5]
, and confirming the computed impurities.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With