Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Learning rate doesn't change for AdamOptimizer in TensorFlow

Tags:

I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).

Here is a code snippet from what I have so far:

optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

sess.run(tf.initialize_all_variables())

for i in range(0, 10000):
   sess.run(train_op)
   print sess.run(optimizer._lr_t)

If I run the code I constantly get the initial learning rate (1e-3) i.e. I see no change.

What is a correct way for getting the learning rate at every step?

I would like to add that this question is really similar to mine. However, I cannot post my findings in the comment section there since I do not have enough rep.

like image 268
Filip Avatar asked Aug 10 '16 20:08

Filip


1 Answers

I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr stepsize (designed with alpha in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t that is updated for every t step, and should correspond to the self._lr_t attribute. But in fact, as you observe, evaluating the value for the self._lr_t tensor at any point during the training returns always the initial value, that is, _lr.

So your question, as I understood it, is how to get the alpha_t for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:

alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)

BACKGROUND

As you observed, the _lr_t tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer with the same alpha). And, in fact, other values do change: a quick look at the optimizer's __dict__ shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots'].

By inspecting them through training, I noticed that only _beta1_power, _beta2_power and the _slots get updated.

Further inspecting the optimizer's code, in line 211, we see the following update:

update_beta1 = self._beta1_power.assign(
        self._beta1_power * self._beta1_t,
        use_locking=self._use_locking)

Which basically means that _beta1_power, which is initialized with _beta1, will be multiplied by _beta_1_t after every iteration, which is also initialized with beta_1_t.

But here comes the confusing part: _beta1_t and _beta2_t never get updated, so effectively they hold the initial values (_beta1and _beta2) through the whole training, contradicting the notation of the paper in a similar fashion as _lr and lr_t do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).

So after this small background we can see that _beta_1_power and _beta_2_power are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_tin the paper. Going back to the definition of alpha_t in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:

SOLUTION

optimizer = tf.train.AdamOptimizer()
# rest of the graph...

# ... somewhere in your session
# note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
at = a0* (1-bb2)**0.5 /(1-bb1)
print(at)

The variable at holds the alpha_t for the current training step.

DISCLAIMER

I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t, since it does not depend on the data.

Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.

Hope it helps! Cheers,
Andres

like image 139
fr_andres Avatar answered Sep 24 '22 16:09

fr_andres