Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the epsilon hyperparameter affect tf.train.AdamOptimizer?

When I set epsilon=10e-8, AdamOptimizer doesn't work. When I set it to 1, it works just fine.

like image 265
hhb1994 Avatar asked Apr 05 '17 03:04

hhb1994


People also ask

What does Epsilon do in Adam Optimizer?

The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero.

What does TF train AdamOptimizer do?

train. AdamOptimizer is compatible with eager mode and tf. function . When eager execution is enabled, learning_rate , beta1 , beta2 , and epsilon can each be a callable that takes no arguments and returns the actual value to use.

What are the hyper parameters of Adam gradient descent?

Adam includes the hyperparameters: α, 𝛽1 (from Momentum), 𝛽2 (from RMSProp).

How does the learning rate affect the training of the neural network?

The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.


1 Answers

t <- t + 1

lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g

v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g

where g is gradient

variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero. So, ideally epsilon should be a small value. But, having a small epsilon in the denominator will make larger weight updates and with subsequent normalization larger weights will always be normalized to 1.

So, I guess when you train with small epsilon the optimizer will become unstable.

The trade-off is that the bigger you make epsilon (and the denominator), the smaller the weight updates are and thus slower the training progress will be. Most times you want the denominator to be able to get small. Usually, the epsilon value greater than 10e-4 performs better.

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. check here

like image 66
Nandeesh Avatar answered Oct 05 '22 19:10

Nandeesh