Tensorflow: Confusion regarding the adam optimizer

Tags:

tensorflow

I'm confused regarding as to how the adam optimizer actually works in tensorflow.

The way I read the docs, it says that the learning rate is changed every gradient descent iteration.

But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like

for epoch in epochs
     for batch in data
          sess.run(train_adam_step, feed_dict={eta:1e-3})

So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t is incremented each time I call the optimizer?

Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum and decay to match beta1 and beta2 respectively. Is that correct?

260

asked Jun 15 '16 18:06

Nimitz14

1 Answers

I find the documentation quite clear, I will paste here the algorithm in pseudo-code:

Your parameters:

learning_rate: between 1e-4 and 1e-2 is standard
beta1: 0.9 by default
beta2: 0.999 by default
epsilon: 1e-08 by default

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

Initialization:

m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)

m_t and v_t will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)

At each iteration t, and for each parameter of the model:

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

Here lr_t is a bit different from learning_rate because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t). When t is high (t > 1./(1.-beta2)), lr_t is almost equal to learning_rate

To answer your question, you just need to pass a fixed learning rate, keep beta1 and beta2 default values, maybe modify epsilon, and Adam will do the magic :)

Link with RMSProp

Adam with beta1=1 is equivalent to RMSProp with momentum=0. The argument beta2 of Adam and the argument decay of RMSProp are the same.

However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.

A detailed description of rmsprop.

maintain a moving (discounted) average of the square of gradients
divide gradient by the root of this average
(can maintain a momentum)

Here is the pseudo-code:

v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom

answered Oct 01 '22 01:10

Olivier Moindrot

Related questions
                            
                                Flask - How do I read the raw body in a POST request when the content type is "application/x-www-form-urlencoded"
                            
                                A better way to accept multiple request types in a single view method?
                            
                                Python File object to Flask's FileStorage
                            
                                Remove multiple items from list in Python [closed]
                            
                                Python 3.3.2 check that object is of type file
                            
                                Print original input order of dictionary in python
                            
                                Saving plots to pdf files using matplotlib
                            
                                Configure Pytest discovery to ignore class name
                            
                                Disable SQL detection in JetBrains PyCharm
                            
                                Parse French date in python
                            
                                Read lists into columns of pandas DataFrame
                            
                                how do you install python package without dependencies
                            
                                Numpy: Replace random elements in an array
                            
                                removing duplicates of a list of sets
                            
                                Profile Python import times
                            
                                How to prettyprint (human readably print) a Python dict in json format (double quotes)? [duplicate]
                            
                                what should be in gitignore, and how do I put env folder to gitignore and is my folder structure correct?
                            
                                Django nested if else in templates
                            
                                Putting a variable into a string (quote)
                            
                                F1-score per class for multi-class classification

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With