Significance of auxiliary output in Multi-input and multi-output model using deep network

Tags:

I am refering to the keras documentation to build a network which takes multiple input in the form of embeddings and some other important features. But I didn't understand exact effect of auxiliary loss if we have already defined main loss.

Here we insert the auxiliary loss, allowing the LSTM and Embedding layer to be trained smoothly even though the main loss will be much higher in the model.

As mentioned in document I am assuming it helps to train smoothly on Embedding/any other layer defined before. My question is, how to decide weights for auxiliary loss.

We compile the model and assign a weight of 0.2 to the auxiliary loss. To specify different loss_weights or loss for each different output, you can use a list or a dictionary.

I will really appreciate if someone could explain about how to decide loss weights and how higher/lower value of auxiliary loss weight affect on model training and on prediction.

992

asked Apr 04 '17 19:04

Nilesh Birari

1 Answers

This is a really interesting issue. The idea of auxiliary classifiers is not so uncommon as one may think. It's used e.g. in Inception architecture. In this answer I would try to provide you a few intuitions on why this tweak might actually help in training:

It helps gradient to pass down to lower layers: one may think that a loss defined for an auxiliary classifier is conceptually similiar to the main loss - because both of them measure how good our model is. Due to that we may assume that gradient w.r.t. to lower layers should be similiar for both of these loses. A vanishing gradient phenomenon is still a case - even though we have techniques like e.g. Batch Normalization - so every additional help might improve your training performance.
It makes a low-level features more accurate: while we are training our network - the information about how good are model`s low-level features are and how to change them must go throught all other layers of your network. This might not only make gradient vanishing - but due to the fact that operations performed during neural-net computations might be really complexed - this could also make the information about your lower-level features irrelevant. This is really important especially in a early stage of training - when most of your features are rather random (due to random start) - and the direction to which your weights are pushed - might be semantically bizarre. This problem might be overcome by auxiliary outputs because in this setup - your lower level features are made to be meaningful from the earliest part of training.
This might be considered as an intelligent regularization: you are putting a meaningful constrain on your model which might prevent overfitting, especially on small datasets.

From what I wrote above one may infer some hints about how to set the auxilliary loss weight:

It's good to have it bigger at the beginning of training.
It should help in passing information through your network but it also shouldn't disturb the training process. So the rule of thumb in which the deeper aux output is - the bigger loss weight is - is imho reasonable.
If your dataset is not to big or training time is not so long - you may try to actually tune it using some kind of hyperparameter optimization.
You should remember that your main loss is the most important - and even though aux output might help - their weight loss should be relatively smaller than a main loss weight.

answered Oct 18 '22 21:10

Marcin Możejko

Related questions
                            
                                TensorFlow - GradientDescentOptimizer - are we actually finding global optimum?
                            
                                Tensorflow error: TypeError: __init__() got an unexpected keyword argument 'dct_method' [closed]
                            
                                Semantic Segmentation Loss functions
                            
                                Can I (selectively) invert Theano gradients during backpropagation?
                            
                                ImportError: dynamic module does not define module export function (PyInit__caffe)
                            
                                How to read json files in Tensorflow?
                            
                                Avoiding Dummy variable trap and neural network
                            
                                How do I merge two trained neural network weight matrices into one?
                            
                                Copying weights of a specific layer - keras
                            
                                Neural Network to recognize accelerometer pattern
                            
                                How to use the Embedding Layer for Recurrent Neural Network (RNN) in Keras
                            
                                How to set parameters of the Adadelta Algorithm in Tensorflow correctly?
                            
                                Matthews Correlation Coefficient with Keras
                            
                                Do you need to standardize inputs if you are using Batch Normalization?
                            
                                Siamese network output
                            
                                Neural Network returning NaN as output
                            
                                Theano import error: No module named cPickle
                            
                                Translating a TensorFlow LSTM into synapticjs
                            
                                doc2vec: How is PV-DBOW implemented
                            
                                How to exactly add L1 regularisation to tensorflow error function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Significance of auxiliary output in Multi-input and multi-output model using deep network

Tags:

neural-network

deep-learning

keras

lstm

Nilesh Birari

People also ask

1 Answers

Marcin Możejko

Recent Activity

Donate For Us