Let's say I have network with following params:
Now, I know that the loss is calculated in the following manner: binary cross entropy is applied to each pixel in the image with regards to each class. So essentially, each pixel will have 5 loss values
What happens after this step?
When I train my network, it prints only a single loss value for an epoch. There are many levels of loss accumulation that need to happen to produce a single value and how it happens is not clear at all in the docs/code.
axis=-1
. So is this an average of all the pixels per class or average of all the classes or is it both?? To state it in a different way: how are the losses for different classes combined to produce a single loss value for an image?
This is not explained in the docs at all and would be very helpful for people doing multi-class predictions on keras, regardless of the type of network. Here is the link to the start of keras code where one first passes in the loss function.
The closest thing I could find to an explanation is
loss: String (name of objective function) or objective function. See losses. If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses
from keras. So does this mean that the losses for each class in the image is simply summed?
Example code here for someone to try it out. Here's a basic implementation borrowed from Kaggle and modified for multi-label prediction:
# Build U-Net model
num_classes = 5
IMG_DIM = 256
IMG_CHAN = 3
weights = {0: 1, 1: 1, 2: 1, 3: 1, 4: 1000} #chose an extreme value just to check for any reaction
inputs = Input((IMG_DIM, IMG_DIM, IMG_CHAN))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (c5)
u6 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (c6)
u7 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (c7)
u8 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (c8)
u9 = Conv2DTranspose(8, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (c9)
outputs = Conv2D(num_classes, (1, 1), activation='sigmoid') (c9)
model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss=weighted_loss(weights), metrics=[mean_iou])
def weighted_loss(weightsList):
def lossFunc(true, pred):
axis = -1 #if channels last
#axis= 1 #if channels first
classSelectors = K.argmax(true, axis=axis)
classSelectors = [K.equal(tf.cast(i, tf.int64), tf.cast(classSelectors, tf.int64)) for i in range(len(weightsList))]
classSelectors = [K.cast(x, K.floatx()) for x in classSelectors]
weights = [sel * w for sel,w in zip(classSelectors, weightsList)]
weightMultiplier = weights[0]
for i in range(1, len(weights)):
weightMultiplier = weightMultiplier + weights[i]
loss = BCE_loss(true, pred) - (1+dice_coef(true, pred))
loss = loss * weightMultiplier
return loss
return lossFunc
model.summary()
The actual BCE-DICE loss function can be found here.
Motivation for the question: Based on the above code, the total validation loss of the network after 20 epochs is ~1%; however, the mean intersection over union scores for the first 4 classes are above 95% each, but for the last class its 23%. Clearly indicating that the 5th class isn't doing well at all. However, this loss in accuracy isn't being reflected at all in the loss. Hence, that means the individual losses for the sample are being combined in a way that completely negates the huge loss we see for the 5th class. And, so when the per sample losses are being combined over batch, it's still really low. I'm not sure how to reconcile this information.
Creating custom loss functions in Keras A custom loss function can be created by defining a function that takes the true values and predicted values as required parameters. The function should return an array of losses. The function can then be passed at the compile stage.
BinaryCrossentropy class Computes the cross-entropy loss between true labels and predicted labels. Use this cross-entropy loss for binary (0 or 1) classification applications. The loss function requires the following inputs: y_true (true label): This is either 0 or 1.
The loss function is used to optimize your model. This is the function that will get minimized by the optimizer. A metric is used to judge the performance of your model. This is only for you to look at and has nothing to do with the optimization process.
That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
Although I have already mentioned part of this answer in a related answer, but let's inspect the source code step-by-step with more details to find the answer concretely.
First, Let's feedforward(!): there is a call to weighted_loss
function which takes y_true
, y_pred
, sample_weight
and mask
as inputs:
weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
weighted_loss
is actually an element of a list which contains all the (augmented) loss functions passed to fit
method:
weighted_losses = [
weighted_masked_objective(fn) for fn in loss_functions]
The "augmented" word I mentioned is important here. That's because, as you can see above, the actual loss function is wrapped by another function called weighted_masked_objective
which has been defined as follows:
def weighted_masked_objective(fn):
"""Adds support for masking and sample-weighting to an objective function.
It transforms an objective function `fn(y_true, y_pred)`
into a sample-weighted, cost-masked objective function
`fn(y_true, y_pred, weights, mask)`.
# Arguments
fn: The objective function to wrap,
with signature `fn(y_true, y_pred)`.
# Returns
A function with signature `fn(y_true, y_pred, weights, mask)`.
"""
if fn is None:
return None
def weighted(y_true, y_pred, weights, mask=None):
"""Wrapper function.
# Arguments
y_true: `y_true` argument of `fn`.
y_pred: `y_pred` argument of `fn`.
weights: Weights tensor.
mask: Mask tensor.
# Returns
Scalar tensor.
"""
# score_array has ndim >= 2
score_array = fn(y_true, y_pred)
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in Theano
mask = K.cast(mask, K.floatx())
# mask should have the same shape as score_array
score_array *= mask
# the loss per batch should be proportional
# to the number of unmasked samples.
score_array /= K.mean(mask)
# apply sample weighting
if weights is not None:
# reduce score_array to same ndim as weight array
ndim = K.ndim(score_array)
weight_ndim = K.ndim(weights)
score_array = K.mean(score_array,
axis=list(range(weight_ndim, ndim)))
score_array *= weights
score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
return K.mean(score_array)
return weighted
So, there is a nested function, weighted
, that actually calls the real loss function fn
in the line score_array = fn(y_true, y_pred)
. Now, to be concrete, in case of the example the OP provided, the fn
(i.e. loss function) is binary_crossentropy
. Therefore we need to take a look at the definition of binary_crossentropy()
in Keras:
def binary_crossentropy(y_true, y_pred):
return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)
which in turn, calls the backend function K.binary_crossentropy()
. In case of using Tensorflow as the backend, the definition of K.binary_crossentropy()
is as follows:
def binary_crossentropy(target, output, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
# Arguments
target: A tensor with the same shape as `output`.
output: A tensor.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
# Returns
A tensor.
"""
# Note: tf.nn.sigmoid_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output / (1 - output))
return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
logits=output)
The tf.nn.sigmoid_cross_entropy_with_logits
returns:
A Tensor of the same shape as
logits
with the componentwise logistic losses.
Now, let's backpropagate(!): considering the above note, the output shape of K.binray_crossentropy
would be the same as y_pred
(or y_true
). As the OP mentioned, y_true
has a shape of (batch_size, img_dim, img_dim, num_classes)
. Therefore, the K.mean(..., axis=-1)
is applied over a tensor of shape (batch_size, img_dim, img_dim, num_classes)
which results in an output tensor of shape (batch_size, img_dim, img_dim)
. So the loss values of all classes are averaged for each pixel in the image. Hence, the shape of score_array
in weighted
function mentioned above would be (batch_size, img_dim, img_dim)
. There is one more step: the return statement in weighted
function takes the mean again i.e. return K.mean(score_array)
. So how does it compute the mean? If you take a look at the definition of mean
backend function you would find out that the axis
argument is None
by default:
def mean(x, axis=None, keepdims=False):
"""Mean of a tensor, alongside the specified axis.
# Arguments
x: A tensor or variable.
axis: A list of integer. Axes to compute the mean.
keepdims: A boolean, whether to keep the dimensions or not.
If `keepdims` is `False`, the rank of the tensor is reduced
by 1 for each entry in `axis`. If `keepdims` is `True`,
the reduced dimensions are retained with length 1.
# Returns
A tensor with the mean of elements of `x`.
"""
if x.dtype.base_dtype == tf.bool:
x = tf.cast(x, floatx())
return tf.reduce_mean(x, axis, keepdims)
And it calls the tf.reduce_mean()
which given an axis=None
argument, takes the mean over all the axes of input tensor and return one single value. Therefore, the mean of the whole tensor of shape (batch_size, img_dim, img_dim)
is computed, which translates to taking the average over all the labels in the batch and over all their pixels, and is returned as one single scalar value which represents the loss value. Then, this loss value is reported back by Keras and is used for optimization.
Bonus: what if our model has multiple output layers and therefore multiple loss functions are used?
Remember the first piece of code I mentioned in this answer:
weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
As you can see there is an i
variable which is used for indexing the array. You may have guessed correctly: it is actually part of a loop which computes the loss value for each output layer using its designated loss function and then takes the (weighted) sum of all these loss values to compute the total loss:
# Compute total loss.
total_loss = None
with K.name_scope('loss'):
for i in range(len(self.outputs)):
if i in skip_target_indices:
continue
y_true = self.targets[i]
y_pred = self.outputs[i]
weighted_loss = weighted_losses[i]
sample_weight = sample_weights[i]
mask = masks[i]
loss_weight = loss_weights_list[i]
with K.name_scope(self.output_names[i] + '_loss'):
output_loss = weighted_loss(y_true, y_pred,
sample_weight, mask)
if len(self.outputs) > 1:
self.metrics_tensors.append(output_loss)
self.metrics_names.append(self.output_names[i] + '_loss')
if total_loss is None:
total_loss = loss_weight * output_loss
else:
total_loss += loss_weight * output_loss
if total_loss is None:
if not self.losses:
raise ValueError('The model cannot be compiled '
'because it has no loss to optimize.')
else:
total_loss = 0.
# Add regularization penalties
# and other layer-specific losses.
for loss_tensor in self.losses:
total_loss += loss_tensor
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With