I am refering to the keras documentation to build a network which takes multiple input in the form of embeddings and some other important features. But I didn't understand exact effect of auxiliary loss if we have already defined main loss.
Here we insert the auxiliary loss, allowing the LSTM and Embedding layer to be trained smoothly even though the main loss will be much higher in the model.
As mentioned in document I am assuming it helps to train smoothly on Embedding/any other layer defined before. My question is, how to decide weights for auxiliary loss.
We compile the model and assign a weight of 0.2 to the auxiliary loss. To specify different loss_weights or loss for each different output, you can use a list or a dictionary.
I will really appreciate if someone could explain about how to decide loss weights and how higher/lower value of auxiliary loss weight affect on model training and on prediction.
The Aux-Net model is based on the hedging algorithm and online gradient descent. It employs a model of varying depth in an online setting using single pass learning. Aux-Net is a foundational work towards scalable neural network model for a dynamic complex environment requiring ad hoc or inconsistent input data.
Neural Networks for Multi-OutputsNeural network models also support multi-output regression and have the benefit of learning a continuous function that can model a more graceful relationship between changes in input and output.
The auxiliary loss proposed is to minimize the classification error of a neural network classifier that predicts whether or not a pair of states sampled from the agents current episode trajectory are in order. The classifier takes as input a pair of states as well as the agent's memory.
Multi-output classification is a type of machine learning that predicts multiple outputs simultaneously. In multi-output classification, the model will give two or more outputs after making any prediction. In other types of classifications, the model usually predicts only a single output.
This is a really interesting issue. The idea of auxiliary classifiers is not so uncommon as one may think. It's used e.g. in Inception architecture. In this answer I would try to provide you a few intuitions on why this tweak might actually help in training:
It helps gradient to pass down to lower layers: one may think that a loss defined for an auxiliary classifier is conceptually similiar to the main loss - because both of them measure how good our model is. Due to that we may assume that gradient w.r.t. to lower layers should be similiar for both of these loses. A vanishing gradient phenomenon is still a case - even though we have techniques like e.g. Batch Normalization - so every additional help might improve your training performance.
It makes a low-level features more accurate: while we are training our network - the information about how good are model`s low-level features are and how to change them must go throught all other layers of your network. This might not only make gradient vanishing - but due to the fact that operations performed during neural-net computations might be really complexed - this could also make the information about your lower-level features irrelevant. This is really important especially in a early stage of training - when most of your features are rather random (due to random start) - and the direction to which your weights are pushed - might be semantically bizarre. This problem might be overcome by auxiliary outputs because in this setup - your lower level features are made to be meaningful from the earliest part of training.
This might be considered as an intelligent regularization: you are putting a meaningful constrain on your model which might prevent overfitting, especially on small datasets.
From what I wrote above one may infer some hints about how to set the auxilliary loss weight:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With