I am trying to use leaky_relu as my activation function for hidden layers. For parameter alpha, it is explained as:
slope of the activation function at x < 0
What does this means? What effect will the different values of alpha have on the results of the model?
A deep explanation regarding ReLU and its variant is present in the following links:
In regular ReLU the main drawback is the fact that the input for the activation can be negative, due to operation performed in the network causing to what is referred as "Dying RELU" problem
the gradient is 0 whenever the unit is not active. This could lead to cases where a unit never activates as a gradient-based optimization algorithm will not adjust the weights of a unit that never activates initially. Further, like the vanishing gradients problem, we might expect learning to be slow when training ReLU networks with constant 0 gradients.
So Leaky ReLU substitutes zero values with some small value say 0.001 (referred as “alpha”). So, for leaky ReLU, the function f(x) = max(0.001x, x). Now gradient descent of 0.001x will be having a non-zero value and it will continue learning without reaching dead end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With