I'm confused regarding as to how the adam optimizer actually works in tensorflow.
The way I read the docs, it says that the learning rate is changed every gradient descent iteration.
But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like
for epoch in epochs
for batch in data
sess.run(train_adam_step, feed_dict={eta:1e-3})
So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t
is incremented each time I call the optimizer?
Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum
and decay
to match beta1
and beta2
respectively. Is that correct?
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.
Adam optimizer is an adoptive learning rate optimizer that is very popular for deep learning, especially in computer vision. I have seen some papers that after specific epochs, for example, 50 epochs, they decrease its learning rate by dividing it by 10.
Geoff Hinton, recommends setting γ to be 0.9, while a default value for the learning rate η is 0.001. This allows the learning rate to adapt over time, which is important to understand since this phenomena is also present in Adam.
In Adam instead of adapting learning rates based on the average first moment as in RMSP, Adam makes use of the average of the second moments of the gradients. Adam. This algorithm calculates the exponential moving average of gradients and square gradients.
I find the documentation quite clear, I will paste here the algorithm in pseudo-code:
Your parameters:
learning_rate
: between 1e-4 and 1e-2 is standardbeta1
: 0.9 by defaultbeta2
: 0.999 by defaultepsilon
: 1e-08 by default
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Initialization:
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
m_t
and v_t
will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)
At each iteration t
, and for each parameter of the model:
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
Here lr_t
is a bit different from learning_rate
because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t)
. When t
is high (t > 1./(1.-beta2)
), lr_t
is almost equal to learning_rate
To answer your question, you just need to pass a fixed learning rate, keep beta1
and beta2
default values, maybe modify epsilon
, and Adam will do the magic :)
Adam with beta1=1
is equivalent to RMSProp with momentum=0
. The argument beta2
of Adam and the argument decay
of RMSProp are the same.
However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.
Here is the pseudo-code:
v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With