Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to update model parameters with accumulated gradients?

I'm using TensorFlow to build a deep learning model. And new to TensorFlow.

Due to some reason, my model has limited batch size, then this limited batch-size will make the model has a high variance.

So, I want to use some trick to make the batch size larger. My idea is to store the gradients of each mini-batch, for example 64 mini-batches, and then sum the gradients together, use the mean gradients of this 64 mini batches of training data to update the model's parameters.

This means that for the first 63 mini-batches, do not update the parameters, and after the 64 mini batch, update the model's parameters only once.

But as TensorFlow is graph based, do anyone know how to implement this wanted feature?

Thanks very much.

like image 242
weixsong Avatar asked Feb 10 '17 10:02

weixsong


People also ask

How does gradient accumulation work?

One solution to this problem is gradient accumulation. The idea is to split up the batch into smaller mini-batches which are run sequentially, while accumulating their results. The accumulated results are used to update the model parameters only at the end of the last mini-batch.

Is gradient accumulation useful?

Gradient accumulation is extremely useful when working with large images/volumetric data, using low-end hardware, or training on multiple GPUs. For me, the most important feature is to be able to use larger batch sizes without exhausting memory.

Does gradient accumulation speed up training?

The training speed can be accelerated when combining DDP and gradient accumulation. When applying gradient accumulation, the optimizer. step() is called every K steps intead of every step. And as we know every training step (with loss.

What is TensorFlow gradient?

Gradient tapes TensorFlow provides the tf. GradientTape API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf. Variable s. TensorFlow "records" relevant operations executed inside the context of a tf.


1 Answers

I found a solution here: https://github.com/tensorflow/tensorflow/issues/3994#event-766328647

opt = tf.train.AdamOptimizer()
tvs = tf.trainable_variables()
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]                                        
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
gvs = opt.compute_gradients(rmse, tvs)
accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]
train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)])

In the training loop:

while True:
    sess.run(zero_ops)
    for i in xrange(n_minibatches):
        sess.run(accum_ops, feed_dict=dict(X: Xs[i], y: ys[i]))
    sess.run(train_step)

But this code seems not very clean and pretty, does anyone know how to optimize these code?

like image 135
weixsong Avatar answered Oct 15 '22 04:10

weixsong