I'm studying TensorFlow and how to use it, even if I'm not an expert of neural networks and deep learning (just the basics).
Following tutorials, I don't understand the real and practical differences between the three optimizers for loss. I look at the API and I understand the principles, but my questions are:
1. When is it preferable to use one instead of the others ?
2. Are there important differences to know ?
Adagrad (Adaptive Gradient Algorithm) In Adagrad optimizer, there is no momentum concept so, it is much simpler compared to SGD with momentum. The idea behind Adagrad is to use different learning rates for each parameter base on iteration.
In SGD, the step is given by - learning_rate * gradient , while learning_rate is a hyperparameter. ADAGRAD also has a learning_rate hyperparameter, but the actual learning rate for each component of the gradient is calculated individually.
The Momentum method uses the first moment with a decay rate to gain speed. AdaGrad uses the second moment with no decay to deal with sparse features. RMSProp uses the second moment by with a decay rate to speed up from AdaGrad. Adam uses both first and second moments, and is generally the best choice.
Momentum is an extension to the gradient descent optimization algorithm, often referred to as gradient descent with momentum.
Here is a brief explanation based on my understanding:
Adam or adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately.
A few visualizations:
I would say that SGD, Momentum and Nesterov are inferior than the last 3.
Salvador Dali's answer already explains about the differences between some popular methods (i.e. optimizers), but I would try to elaborate on them some more.
(Note that our answers disagree about some points, especially regarding ADAGRAD.)
(Mostly based on section 2 in the paper On the importance of initialization and momentum in deep learning.)
Each step in both CM and NAG is actually composed of two sub-steps:
[0.9,1)
) of the last step.CM takes the gradient sub-step first, while NAG takes the momentum sub-step first.
Here is a demonstration from an answer about intuition for CM and NAG:
So NAG seems to be better (at least in the image), but why?
The important thing to note is that it doesn't matter when the momentum sub-step comes - it would be the same either way. Therefore, we might as well behave is if the momentum sub-step has already been taken.
Thus, the question actually is: Assuming that the gradient sub-step is taken after the momentum sub-step, should we calculate the gradient sub-step as if it started in the position before or after taking the momentum sub-step?
"After it" seems like the right answer, as generally, the gradient at some point θ
roughly points you in the direction from θ
to a minimum (with the relatively right magnitude), while the gradient at some other point is less likely to point you in the direction from θ
to a minimum (with the relatively right magnitude).
Here is a demonstration (from the gif below):
Note that this argument for why NAG is better is independent of whether the algorithm is close to a minimum.
In general, both NAG and CM often have the problem of accumulating more momentum than is good for them, so whenever they should change direction, they have an embarrassing "response time". The advantage of NAG over CM that we explained doesn't prevent the problem, but only makes the "response time" of NAG less embarrassing (but embarrassing still).
This "response time" problem is beautifully demonstrated in the gif by Alec Radford (which appeared in Salvador Dali's answer):
(Mostly based on section 2.2.2 in ADADELTA: An Adaptive Learning Rate Method (the original ADADELTA paper), as I find it much more accessible than Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (the original ADAGRAD paper).)
In SGD, the step is given by - learning_rate * gradient
, while learning_rate
is a hyperparameter.
ADAGRAD also has a learning_rate
hyperparameter, but the actual learning rate for each component of the gradient is calculated individually.
The i
-th component of the t
-th step is given by:
learning_rate - --------------------------------------- * gradient_i_t norm((gradient_i_1, ..., gradient_i_t))
while:
gradient_i_k
is the i
-th component of the gradient in the k
-th step(gradient_i_1, ..., gradient_i_t)
is a vector with t
components. This isn't intuitive (at least to me) that constructing such a vector makes sense, but that's what the algorithm does (conceptually).norm(vector)
is the Eucldiean norm (aka l2
norm) of vector
, which is our intuitive notion of length of vector
.gradient_i_t
(in this case, learning_rate / norm(...)
) is often called "the learning rate" (in fact, I called it "the actual learning rate" in the previous paragraph). I guess this is because in SGD the learning_rate
hyperparameter and this expression are one and the same.E.g. if:
i
-th component of the gradient in the first step is 1.15
i
-th component of the gradient in the second step is 1.35
i
-th component of the gradient in the third step is 0.9
Then the norm of (1.15, 1.35, 0.9)
is the length of the yellow line, which is:sqrt(1.15^2 + 1.35^2 + 0.9^2) = 1.989
.
And so the i
-th component of the third step is: - learning_rate / 1.989 * 0.9
Note two things about the i
-th component of the step:
learning_rate
.This means that ADAGRAD is sensitive to the choice of the hyperparameter learning_rate
.
In addition, it might be that after some time the steps become so small, that ADAGRAD virtually gets stuck.
From the ADADELTA paper:
The idea presented in this paper was derived from ADAGRAD in order to improve upon the two main drawbacks of the method: 1) the continual decay of learning rates throughout training, and 2) the need for a manually selected global learning rate.
The paper then explains an improvement that is meant to tackle the first drawback:
Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size
w
[...]. This ensures that learning continues to make progress even after many iterations of updates have been done.
Since storingw
previous squared gradients is inefficient, our methods implements this accumulation as an exponentially decaying average of the squared gradients.
By "exponentially decaying average of the squared gradients" the paper means that for each i
we compute a weighted average of all of the squared i
-th components of all of the gradients that were calculated.
The weight of each squared i
-th component is bigger than the weight of the squared i
-th component in the previous step.
This is an approximation of a window of size w
because the weights in earlier steps are very small.
(When I think about an exponentially decaying average, I like to visualize a comet's trail, which becomes dimmer and dimmer as it gets further from the comet:
)
If you make only this change to ADAGRAD, then you will get RMSProp, which is a method that was proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
So in RMSProp, the i
-th component of the t
-th step is given by:
learning_rate - ------------------------------------------------ * gradient_i_t sqrt(exp_decay_avg_of_squared_grads_i + epsilon)
while:
epsilon
is a hyperparameter that prevents a division by zero.exp_decay_avg_of_squared_grads_i
is an exponentially decaying average of the squared i
-th components of all of the gradients calculated (including gradient_i_t
).But as aforementioned, ADADELTA also aims to get rid of the learning_rate
hyperparameter, so there must be more stuff going on in it.
In ADADELTA, the i
-th component of the t
-th step is given by:
sqrt(exp_decay_avg_of_squared_steps_i + epsilon) - ------------------------------------------------ * gradient_i_t sqrt(exp_decay_avg_of_squared_grads_i + epsilon)
while exp_decay_avg_of_squared_steps_i
is an exponentially decaying average of the squared i
-th components of all of the steps calculated (until the t-1
-th step).sqrt(exp_decay_avg_of_squared_steps_i + epsilon)
is somewhat similar to momentum, and according to the paper, it "acts as an acceleration term". (The paper also gives another reason for why it was added, but my answer is already too long, so if you are curious, check out section 3.2.)
(Mostly based on Adam: A Method for Stochastic Optimization, the original Adam paper.)
Adam is short for Adaptive Moment Estimation (see this answer for an explanation about the name).
The i
-th component of the t
-th step is given by:
learning_rate - ------------------------------------------------ * exp_decay_avg_of_grads_i sqrt(exp_decay_avg_of_squared_grads_i) + epsilon
while:
exp_decay_avg_of_grads_i
is an exponentially decaying average of the i
-th components of all of the gradients calculated (including gradient_i_t
).exp_decay_avg_of_grads_i
and exp_decay_avg_of_squared_grads_i
are also corrected to account for a bias toward 0
(for more about that, see section 3 in the paper, and also an answer in stats.stackexchange).Note that Adam uses an exponentially decaying average of the i
-th components of the gradients where most SGD methods use the i
-th component of the current gradient. This causes Adam to behave like "a heavy ball with friction", as explained in the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.
See this answer for more about how Adam's momentum-like behavior is different from the usual momentum-like behavior.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With