In tensorflow, can you use non-smooth function as loss function, such as piece-wise (or with if-else)? If you cant, why you can use ReLU?
In this link SLIM , it says
"For example, we might want to minimize log loss, but our metrics of interest might be F1 score, or Intersection Over Union score (which are not differentiable, and therefore cannot be used as losses)."
Does it mean "not differentiable" at all, such as set problems? Because for ReLU, at point 0, it is not differentiable.
The problem is not with the loss being piece-wise or non-smooth. The problem is that we need a loss function that can send back a non-zero gradient to the network parameters (dloss/dparameter) when there is an error between the output and the expected output. This applies to almost any function used inside the model (e.g. loss functions, activation functions, attention functions).
For example, Perceptrons use a unit step H(x) as an activation function (H(x) = 1 if x > 0 else 0). since the derivative of H(x) is always zero (undefined at x=0), No gradient coming from the loss will pass through it back to the weights (chain rule), so no weights before that function in the network can be updated using gradient descent. Based on that, gradient descent can't be used for perceptrons but can be used for conventional neurons that uses the sigmoid activation function (since the gradient is not zero for all x).
For Relu, the derivative is 1 for x > 0 and 0 otherwise. while the derivative is undefined at x=0, we still can back-propagate the loss gradient through it when x>0. That's why it can be used.
That is why we need a loss function that has a non-zero gradient. Functions like accuracy and F1 have zero gradients everywhere (or undefined at some values of x), so they can't be used, while functions like cross-entropy, L2 and L1 have non-zero gradients, so they can be used. (note that L1 "absolute difference" is piece-wise and not smooth at x=0 but still can be used)
In case you must use a function that doesn't meet the above criteria, try reinforcement learning methods instead (e.g. Policy gradient).
As far as Question #3 of OP goes, you actually don't have to implement the gradient computations yourself. Tensorflow will do that for you, which is one of the things I love about it!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With