I'm using TensorFlow to build a deep learning model. And new to TensorFlow.
Due to some reason, my model has limited batch size, then this limited batch-size will make the model has a high variance.
So, I want to use some trick to make the batch size larger. My idea is to store the gradients of each mini-batch, for example 64 mini-batches, and then sum the gradients together, use the mean gradients of this 64 mini batches of training data to update the model's parameters.
This means that for the first 63 mini-batches, do not update the parameters, and after the 64 mini batch, update the model's parameters only once.
But as TensorFlow is graph based, do anyone know how to implement this wanted feature?
Thanks very much.
One solution to this problem is gradient accumulation. The idea is to split up the batch into smaller mini-batches which are run sequentially, while accumulating their results. The accumulated results are used to update the model parameters only at the end of the last mini-batch.
Gradient accumulation is extremely useful when working with large images/volumetric data, using low-end hardware, or training on multiple GPUs. For me, the most important feature is to be able to use larger batch sizes without exhausting memory.
The training speed can be accelerated when combining DDP and gradient accumulation. When applying gradient accumulation, the optimizer. step() is called every K steps intead of every step. And as we know every training step (with loss.
Gradient tapes TensorFlow provides the tf. GradientTape API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf. Variable s. TensorFlow "records" relevant operations executed inside the context of a tf.
I found a solution here: https://github.com/tensorflow/tensorflow/issues/3994#event-766328647
opt = tf.train.AdamOptimizer()
tvs = tf.trainable_variables()
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
gvs = opt.compute_gradients(rmse, tvs)
accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]
train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)])
In the training loop:
while True:
sess.run(zero_ops)
for i in xrange(n_minibatches):
sess.run(accum_ops, feed_dict=dict(X: Xs[i], y: ys[i]))
sess.run(train_step)
But this code seems not very clean and pretty, does anyone know how to optimize these code?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With