I have a question similar to this one.
Because I have limited resources and I work with a deep model (VGG-16) - used to train a triplet network - I want to accumulate gradients for 128 batches of size one training example, and then propagate the error and update the weights.
It's not clear to me how do I do this. I work with tensorflow but any implementation/pseudocode is welcome.
One solution to this problem is gradient accumulation. The idea is to split up the batch into smaller mini-batches which are run sequentially, while accumulating their results. The accumulated results are used to update the model parameters only at the end of the last mini-batch.
Gradient tapes TensorFlow "records" relevant operations executed inside the context of a tf. GradientTape onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.
Coding the gradient accumulation part is also ridiculously easy on PyTorch. All you need to do is to store the loss at each batch and then update the model parameters only after a set number of batches that you choose. We hold onto optimizer. step() which updates the parameters for accumulation_steps number of batches.
At its core, TensorFlow is just an optimized library for tensor operations (vectors, matrices, etc.) and the calculus operations used to perform gradient descent on arbitrary sequences of calculations.
Let's walk through the code proposed in one of the answers you liked to:
## Optimizer definition - nothing different from any classical example opt = tf.train.AdamOptimizer() ## Retrieve all trainable variables you defined in your graph tvs = tf.trainable_variables() ## Creation of a list of variables with the same shape as the trainable ones # initialized with 0s accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs] zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars] ## Calls the compute_gradients function of the optimizer to obtain... the list of gradients gvs = opt.compute_gradients(rmse, tvs) ## Adds to each element from the list you initialized earlier with zeros its gradient (works because accum_vars and gvs are in the same order) accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)] ## Define the training step (part with variable value update) train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)])
This first part basically adds new variables
and ops
to your graph which will allow you to
accum_ops
in (the list of) variable accum_vars
train_step
Then, to use it when training, you have to follow these steps (still from the answer you linked):
## The while loop for training while ...: # Run the zero_ops to initialize it sess.run(zero_ops) # Accumulate the gradients 'n_minibatches' times in accum_vars using accum_ops for i in xrange(n_minibatches): sess.run(accum_ops, feed_dict=dict(X: Xs[i], y: ys[i])) # Run the train_step ops to update the weights based on your accumulated gradients sess.run(train_step)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With