Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-smooth and non-differentiable customized loss function tensorflow

Tags:

tensorflow

  1. In tensorflow, can you use non-smooth function as loss function, such as piece-wise (or with if-else)? If you cant, why you can use ReLU?

  2. In this link SLIM , it says

"For example, we might want to minimize log loss, but our metrics of interest might be F1 score, or Intersection Over Union score (which are not differentiable, and therefore cannot be used as losses)."

Does it mean "not differentiable" at all, such as set problems? Because for ReLU, at point 0, it is not differentiable.

  1. If you use such customized loss fucntion, do you need to implement the gradient by yourself? Or tensorflow can do it for you automatically? I checked some custumized loss functions, they didn't implement the gradient for their loss function.
like image 790
user2863356 Avatar asked Nov 22 '16 22:11

user2863356


2 Answers

The problem is not with the loss being piece-wise or non-smooth. The problem is that we need a loss function that can send back a non-zero gradient to the network parameters (dloss/dparameter) when there is an error between the output and the expected output. This applies to almost any function used inside the model (e.g. loss functions, activation functions, attention functions).

For example, Perceptrons use a unit step H(x) as an activation function (H(x) = 1 if x > 0 else 0). since the derivative of H(x) is always zero (undefined at x=0), No gradient coming from the loss will pass through it back to the weights (chain rule), so no weights before that function in the network can be updated using gradient descent. Based on that, gradient descent can't be used for perceptrons but can be used for conventional neurons that uses the sigmoid activation function (since the gradient is not zero for all x).

For Relu, the derivative is 1 for x > 0 and 0 otherwise. while the derivative is undefined at x=0, we still can back-propagate the loss gradient through it when x>0. That's why it can be used.

That is why we need a loss function that has a non-zero gradient. Functions like accuracy and F1 have zero gradients everywhere (or undefined at some values of x), so they can't be used, while functions like cross-entropy, L2 and L1 have non-zero gradients, so they can be used. (note that L1 "absolute difference" is piece-wise and not smooth at x=0 but still can be used)

In case you must use a function that doesn't meet the above criteria, try reinforcement learning methods instead (e.g. Policy gradient).

like image 144
Yahia Zakaria Avatar answered Jan 02 '23 22:01

Yahia Zakaria


As far as Question #3 of OP goes, you actually don't have to implement the gradient computations yourself. Tensorflow will do that for you, which is one of the things I love about it!

like image 42
braindead Avatar answered Jan 02 '23 23:01

braindead