In Keras (using TensorFlow as a backend) I am building a model which is working with a huge dataset that is having highly imbalanced classes (labels). To be able to run the training process, I created a generator which feeds chunks of data to the fit_generator
.
According to the documentation for the fit_generator, the output of the generator can either be the tuple (inputs, targets)
or the tuple (inputs, targets, sample_weights)
. Having that in mind, here are a few questions:
class_weight
regards the weights of all classes for the entire dataset whereas
the sample_weights
regards the weights of all classes for each individual chunk
created by the generator. Is that correct? If not, can someone elaborate on the matter?class_weight
to the fit_generator
and then the sample_weights
as an output for each chunk? If yes, then why? If not then which one is better to give?sample_weights
for each chunk, how do I map the weights if some of the classes are missing from a specific chunk? Let me give an example. In my overall dataset, I have 7 possible classes (labels). Because these classes are highly imbalanced, when I create smaller chunks of data as an output from the fit_generator
, some of the classes are missing from the specific chunk. How should I create the sample_weights
for these chunks?sample_weights is used to provide a weight for each training sample. That means that you should pass a 1D array with the same number of elements as your training samples (indicating the weight for each of those samples). class_weights is used to provide a weight or bias for each output class.
Generating class weights In binary classification, class weights could be represented just by calculating the frequency of the positive and negative class and then inverting it so that when multiplied to the class loss, the underrepresented class has a much higher error than the majority class.
The LogisticRegression class provides the class_weight argument that can be specified as a model hyperparameter. The class_weight is a dictionary that defines each class label (e.g. 0 and 1) and the weighting to apply in the calculation of the negative log likelihood when fitting the model.
sample_weight is defined on a per-sample basis and is independent of the class. class_weight is useful when training on highly skewed data sets, for example, a classifier to detect fraudulent transactions. sample_weight is useful when you don't have equal confidence in the samples in your batch.
My understanding is that the class_weight regards the weights of all classes for the entire dataset whereas the sample_weights regards the weights of all classes for each individual chunk created by the generator. Is that correct? If not, can someone elaborate on the matter?
class_weight
affects the relative weight of each class in the calculation of the objective function. sample_weights
, as the name suggests, allows further control of the relative weight of samples that belong to the same class.
Is it necessary to give both the class_weight to the fit_generator and then the sample_weights as an output for each chunk? If yes, then why? If not then which one is better to give?
It depends on your application. Class weights are useful when training on highly skewed data sets; for example, a classifier to detect fraudulent transactions. Sample weights are useful when you don't have equal confidence in the samples in your batch. A common example is performing regression on measurements with variable uncertainty.
If I should give the sample_weights for each chunk, how do I map the weights if some of the classes are missing from a specific chunk? Let me give an example. In my overall dataset, I have 7 possible classes (labels). Because these classes are highly imbalanced, when I create smaller chunks of data as an output from the fit_generator, some of the classes are missing from the specific chunk. How should I create the sample_weights for these chunks?
This is not an issue. sample_weights
is defined on a per-sample basis and is independent from the class. For this reason, the documentation states that (inputs, targets, sample_weights)
should be the same length.
The function _weighted_masked_objective
in engine/training.py
has an example of sample_weights are being applied.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With