Tackling Class Imbalance: scaling contribution to loss and sgd

Tags:

(An update to this question has been added.)

I am a graduate student at the university of Ghent, Belgium; my research is about emotion recognition with deep convolutional neural networks. I'm using the Caffe framework to implement the CNNs.

Recently I've run into a problem concerning class imbalance. I'm using 9216 training samples, approx. 5% are labeled positively (1), the remaining samples are labeled negatively (0).

I'm using the SigmoidCrossEntropyLoss layer to calculate the loss. When training, the loss decreases and the accuracy is extremely high after even a few epochs. This is due to the imbalance: the network simply always predicts negative (0). (Precision and recall are both zero, backing this claim)

To solve this problem, I would like to scale the contribution to the loss depending on the prediction-truth combination (punish false negatives severely). My mentor/coach has also advised me to use a scale factor when backpropagating through stochastic gradient descent (sgd): the factor would be correlated to the imbalance in the batch. A batch containing only negative samples would not update the weights at all.

I have only added one custom-made layer to Caffe: to report other metrics such as precision and recall. My experience with Caffe code is limited but I have a lot of expertise writing C++ code.

Could anyone help me or point me in the right direction on how to adjust the SigmoidCrossEntropyLoss and Sigmoid layers to accomodate the following changes:

adjust the contribution of a sample to the total loss depending on the prediction-truth combination (true positive, false positive, true negative, false negative).
scale the weight update performed by stochastic gradient descent depending on the imbalance in the batch (negatives vs. positives).

Thanks in advance!

Update

I have incorporated the InfogainLossLayer as suggested by Shai. I've also added another custom layer that builds the infogain matrix H based on the imbalance in the current batch.

Currently, the matrix is configured as follows:

H(i, j) = 0          if i != j H(i, j) = 1 - f(i)   if i == j (with f(i) = the frequency of class i in the batch)

I'm planning on experimenting with different configurations for the matrix in the future.

I have tested this on a 10:1 imbalance. The results have shown that the network is learning useful things now: (results after 30 epochs)

Accuracy is approx. ~70% (down from ~97%);
Precision is approx. ~20% (up from 0%);
Recall is approx. ~60% (up from 0%).

These numbers were reached at around 20 epochs and didn't change significantly after that.

!! The results stated above are merely a proof of concept, they were obtained by training a simple network on a 10:1 imbalanced dataset. !!

672

asked May 27 '15 14:05

Maarten Bamelis

1 Answers

Why don't you use the InfogainLoss layer to compensate for the imbalance in your training set?

The Infogain loss is defined using a weight matrix H (in your case 2-by-2) The meaning of its entries are

[cost of predicting 1 when gt is 0,    cost of predicting 0 when gt is 0  cost of predicting 1 when gt is 1,    cost of predicting 0 when gt is 1]

So, you can set the entries of H to reflect the difference between errors in predicting 0 or 1.

You can find how to define matrix H for caffe in this thread.

Regarding sample weights, you may find this post interesting: it shows how to modify the SoftmaxWithLoss layer to take into account sample weights.

Recently, a modification to cross-entropy loss was proposed by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár Focal Loss for Dense Object Detection, (ICCV 2017).
The idea behind focal-loss is to assign different weight for each example based on the relative difficulty of predicting this example (rather based on class size etc.). From the brief time I got to experiment with this loss, it feels superior to "InfogainLoss" with class-size weights.

148

answered Sep 24 '22 11:09

Shai

Related questions
                            
                                What does six dots mean in variadic templates? [duplicate]
                            
                                Building a C++ project using rake in Eclipse
                            
                                Building audio processing Little Endian SDK with NDK
                            
                                Lambda passed to template not defined
                            
                                thread_local static member template definition: initialisation fails with gcc
                            
                                Lambda not found when defined in an inline function in G++ 4.7
                            
                                why is there a "never use non-literal type" rule in constexpr functions?
                            
                                Correctly propagating a `decltype(auto)` variable from a function
                            
                                implementing future::then() equivalent for asynchronous execution in c++11
                            
                                Is using `std::get<I>` on a `std::tuple` guaranteed to be thread-safe for different values of `I`?
                            
                                Capturing reference variable by copy in C++0x lambda
                            
                                The proper way to handle Unicode with C++ in 2018?
                            
                                Most efficient standard-compliant way of reinterpreting int as float
                            
                                C++11 full support on Eclipse [closed]
                            
                                Does it make sense to combine optional with reference_wrapper?
                            
                                What toolchains exist for Continuous Integration with C++?
                            
                                Why can't a constant pointer be a constant expression?
                            
                                gcc and clang implicitly instantiate template arguments during operator overload resolution
                            
                                What are SCARY iterators?
                            
                                Does this really break strict-aliasing rules?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tackling Class Imbalance: scaling contribution to loss and sgd

Tags:

c++

machine-learning

neural-network

deep-learning

caffe

Update

Maarten Bamelis

People also ask

1 Answers

Shai

Recent Activity

Donate For Us