When I set epsilon=10e-8, AdamOptimizer doesn't work. When I set it to 1, it works just fine.
The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero.
train. AdamOptimizer is compatible with eager mode and tf. function . When eager execution is enabled, learning_rate , beta1 , beta2 , and epsilon can each be a callable that takes no arguments and returns the actual value to use.
Adam includes the hyperparameters: α, 𝛽1 (from Momentum), 𝛽2 (from RMSProp).
The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
where g is gradient
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero. So, ideally epsilon should be a small value. But, having a small epsilon in the denominator will make larger weight updates and with subsequent normalization larger weights will always be normalized to 1.
So, I guess when you train with small epsilon the optimizer will become unstable.
The trade-off is that the bigger you make epsilon (and the denominator), the smaller the weight updates are and thus slower the training progress will be. Most times you want the denominator to be able to get small. Usually, the epsilon value greater than 10e-4 performs better.
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. check here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With