When I set <code>epsilon=10e-8</code>, <code>AdamOptimizer</code> doesn't work. When I set it to 1, it works just fine.

<blockquote> t <- t + 1 lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g where g is gradient variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon) </blockquote> The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero. So, ideally epsilon should be a small value. But, having a small epsilon in the denominator will make larger weight updates and with subsequent normalization larger weights will always be normalized to 1. So, I guess when you train with small epsilon the optimizer will become unstable. The trade-off is that the bigger you make epsilon (and the denominator), the smaller the weight updates are and thus slower the training progress will be. Most times you want the denominator to be able to get small. Usually, the epsilon value greater than 10e-4 performs better. <blockquote> The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. check here </blockquote>

How does the epsilon hyperparameter affect tf.train.AdamOptimizer?

1 Answers

t <- t + 1

lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g

v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g

where g is gradient

variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero. So, ideally epsilon should be a small value. But, having a small epsilon in the denominator will make larger weight updates and with subsequent normalization larger weights will always be normalized to 1.

So, I guess when you train with small epsilon the optimizer will become unstable.

The trade-off is that the bigger you make epsilon (and the denominator), the smaller the weight updates are and thus slower the training progress will be. Most times you want the denominator to be able to get small. Usually, the epsilon value greater than 10e-4 performs better.

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. check here

answered Oct 05 '22 19:10

Nandeesh

Related questions
                            
                                Keras class_weight in multi-label binary classification
                            
                                How does pytorch backprop through argmax?
                            
                                How to improve digit recognition of a model trained on MNIST?
                            
                                Python : How to find Accuracy Result in SVM Text Classifier Algorithm for Multilabel Class
                            
                                Plot importance variables xgboost Python
                            
                                Why are these words considered stopwords?
                            
                                scikit-learn, linearsvc - how to get support vectors from the trained SVM?
                            
                                Parallel jobs don't finish in scikit-learn's GridSearchCV
                            
                                xgboost: AttributeError: 'DMatrix' object has no attribute 'handle'
                            
                                Padding time-series subsequences for LSTM-RNN training
                            
                                out of sample definition [closed]
                            
                                Neural Network Back-Propagation Algorithm Gets Stuck on XOR Training PAttern
                            
                                Scikit-learn using GridSearchCV on DecisionTreeClassifier
                            
                                What does train_on_batch() do in keras model?
                            
                                What's the major difference between glove and word2vec?
                            
                                keras predict always output same value in multi-classification
                            
                                joblib.load __main__ AttributeError
                            
                                How to use spaCy to create a new entity and learn only from keyword list
                            
                                Using YOLO or other image recognition techniques to identify all alphanumeric text present in images
                            
                                How to set custom stop words for sklearn CountVectorizer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does the epsilon hyperparameter affect tf.train.AdamOptimizer?

Tags:

machine-learning

neural-network

deep-learning

epsilon

hhb1994

People also ask

1 Answers

Nandeesh

Recent Activity

Donate For Us