Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow: Multiple loss functions vs Multiple training ops

I am creating a Tensorflow model which predicts multiple outputs (with different activations). I think there are two ways to do this:

Method 1: Create multiple loss functions (one for each output), merge them (using tf.reduce_mean or tf.reduce_sum) and pass it to the training op like so:

final_loss = tf.reduce_mean(loss1 + loss2)
train_op = tf.train.AdamOptimizer().minimize(final_loss)

Method 2: Create multiple training operations and then group them like so:

train_op1 = tf.train.AdamOptimizer().minimize(loss1)
train_op2 = tf.train.AdamOptimizer().minimize(loss2)
final_train_op = tf.group(train_op1 train_op2)

My question is whether one method is advantageous over the other? Is there a third method I don't know?

Thanks

like image 272
Ankit Bindal Avatar asked Apr 21 '18 07:04

Ankit Bindal


People also ask

What is loss in TensorFlow training?

We use a loss function to determine how far the predicted values deviate from the actual values in the training data. We change the model weights to make the loss minimum, and that is what training is all about.

How do you do two loss functions?

You can try f= (1/n)loge(loss1)+(loss2), for a suitable 'n' value that scales down the first term in the sum to the range of the loss2 function values. However this will work only if loss1 is a positive variable.

How do you use multiple loss in keras?

How does Keras handle multiple losses? From the Keras documentation, "…the loss value that will be minimized by the model will then be the weighted sum of all individual losses, weighted by the loss_weights coefficients.". Therefore, the final loss is a weighted sum of each loss, passed to the loss parameter.


2 Answers

I want to make a subtle point that I don't think was made in previous answers.

If you were using something like GradientDescentOptimizer, these would be very similar operations. That's because taking gradients is a linear operation, and the gradient of a sum is the same as the sum of the gradients.

But, ADAM does something special: regardless of the scale of your loss, it scales the gradients so that they're always on the order of your learning rate. If you multiplied your loss by 1000, it wouldn't affect ADAM, because the change it would be normalized away.

So, if your two losses are roughly the same magnitude, then it shouldn't make a difference. If one is much larger than the other, then keep in mind that summing before the minimization will essentially ignore the small one, while making two ops will spend equal effort minimizing both.

I personally like dividing them up, which gives you more control over how much to focus on one loss or the other. For example, if it was multi-task learning, and one task was more important to get right than the other, two ops with different learning rates roughly accomplishes this.

like image 164
Sam Bobel Avatar answered Nov 01 '22 20:11

Sam Bobel


The difference between the two methods is demonstrated clearly in this post on multi-task learning in tensorflow.

In short:

Method 1: This is called joint training, since it directly adds the losses together, the result is that all the gradients and updates are done with respect to both losses at the same time. Generally this is used when training multiple outputs using the same set of input features.

Method 2: This creates two separate optimizers and is called alternate training. This is used when you use a subset of input features for each of the outputs. Therefore, when feeding in the feature subset for train_op1, the sub-graph for train_op2 is untouched. Each optimizer can be called in an alternating order using different input features.

If you run both optimizer concurrently with the same input data, then the differences with method 1 is probably very minor.

like image 22
khuang834 Avatar answered Nov 01 '22 21:11

khuang834