Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set parameters of the Adadelta Algorithm in Tensorflow correctly?

I've been using Tensorflow for regression purposes. My neural net is very small with 10 input neurons, 12 hidden neurons in a single layer and 5 output neurons.

  • activation function is relu
  • cost is square distance between output and real value
  • my neural net trains correctly with other optimizers such as GradientDescent, Adam, Adagrad.

However when I try to use Adadelta, the neural net simply won't train. Variables stay the same at every step.

I have tried with every initial learning_rate possible (from 1.0e-6 to 10) and with different weights initialization : it does always the same.

Does anyone have a slight idea of what is going on ?

Thanks so much

like image 722
Amaury Mercier Avatar asked Jul 28 '16 09:07

Amaury Mercier


1 Answers

Short answer: don't use Adadelta

Very few people use it today, you should instead stick to:

  • tf.train.MomentumOptimizer with 0.9 momentum is very standard and works well. The drawback is that you have to find yourself the best learning rate.
  • tf.train.RMSPropOptimizer: the results are less dependent on a good learning rate. This algorithm is very similar to Adadelta, but performs better in my opinion.

If you really want to use Adadelta, use the parameters from the paper: learning_rate=1., rho=0.95, epsilon=1e-6. A bigger epsilon will help at the start, but be prepared to wait a bit longer than with other optimizers to see convergence.

Note that in the paper, they don't even use a learning rate, which is the same as keeping it equal to 1.


Long answer

Adadelta has a very slow start. The full algorithm from the paper is:

Adadelta

The issue is that they accumulate the square of the updates.

  • At step 0, the running average of these updates is zero, so the first update will be very small.
  • As the first update is very small, the running average of the updates will be very small at the beginning, which is kind of a vicious circle at the beginning

I think Adadelta performs better with bigger networks than yours, and after some iterations it should equal the performance of RMSProp or Adam.


Here is my code to play a bit with the Adadelta optimizer:

import tensorflow as tf

v = tf.Variable(10.)
loss = v * v

optimizer = tf.train.AdadeltaOptimizer(1., 0.95, 1e-6)
train_op = optimizer.minimize(loss)

accum = optimizer.get_slot(v, "accum")  # accumulator of the square gradients
accum_update = optimizer.get_slot(v, "accum_update")  # accumulator of the square updates

sess = tf.Session()
sess.run(tf.initialize_all_variables())

for i in range(100):
    sess.run(train_op)
    print "%.3f \t %.3f \t %.6f" % tuple(sess.run([v, accum, accum_update]))

The first 10 lines:

  v       accum     accum_update
9.994    20.000      0.000001
9.988    38.975      0.000002
9.983    56.979      0.000003
9.978    74.061      0.000004
9.973    90.270      0.000005
9.968    105.648     0.000006
9.963    120.237     0.000006
9.958    134.077     0.000007
9.953    147.205     0.000008
9.948    159.658     0.000009
like image 55
Olivier Moindrot Avatar answered Oct 13 '22 23:10

Olivier Moindrot