Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow average gradients over several batches

This is a possible duplicate of Tensorflow: How to get gradients per instance in a batch?. I ask it anyway, because there has not been a satisfying answer and the goal here is a bit different.

I have a very big network that I can fit on my GPU but the max batch size I can feed is 32. Anything bigger than that causes the GPU to run out of memory. I want to use a bigger batch in order to get a more accurate approximation of the gradient.

For concreteness, let's say I want to compute the gradient on a big batch of size 96 by feeding 3 batches of 32 in turn. The best way that I know of is to use Optimizer.compute_gradients() and Optimizer.apply_gradients(). Here is a small example how it can work

import tensorflow as tf
import numpy as np

learn_rate = 0.1

W_init = np.array([ [1,2,3], [4,5,6], [7,8,9] ], dtype=np.float32)
x_init = np.array([ [11,12,13], [14,15,16], [17,18,19] ], dtype=np.float32)

X = tf.placeholder(dtype=np.float32, name="x")
W = tf.Variable(W_init, dtype=np.float32, name="w")
y = tf.matmul(X, W, name="y")
loss = tf.reduce_mean(y, name="loss")

opt = tf.train.GradientDescentOptimizer(learn_rate)
grad_vars_op = opt.compute_gradients(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

# Compute the gradients for each batch
grads_vars1 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,0]})
grads_vars2 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,1]})
grads_vars3 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,2]})

# Separate the gradients from the variables
grads1 = [ grad for grad, var in grads_vars1 ]
grads2 = [ grad for grad, var in grads_vars2 ]
grads3 = [ grad for grad, var in grads_vars3 ]
varl   = [ var  for grad, var in grads_vars1 ]

# Average the gradients
grads  = [ (g1 + g2 + g3)/3 for g1, g2, g3 in zip(grads1, grads2, grads3)]

sess.run(opt.apply_gradients(zip(grads,varl)))

print("Weights after 1 gradient")
print(sess.run(W))

Now this is all very ugly and inefficient since the forward pass is being run on the GPU while averaging the gradients happens on the CPU and then applying them happens on the GPU again.

Moreover, this code throws an exception because grads is a list of np.arrays and to make it work, one would have to create a tf.placeholder for every gradient.

I am sure there should be a better and more efficient way to do this? Any suggestions?

like image 744
niko Avatar asked Aug 31 '17 17:08

niko


1 Answers

You can create copy of trainable_variables and accumulate batch gradients. Here's few simple steps to follow

...
opt = tf.train.GradientDescentOptimizer(learn_rate)

# constant to scale sum of gradient
const = tf.constant(1/n_batches)
# get all trainable variables
t_vars = tf.trainable_variables()
# create a copy of all trainable variables with `0` as initial values
accum_tvars = [tf.Variable(tf.zeros_like(tv.initialized_value()),trainable=False) for t_var in t_vars]                                        
# create a op to initialize all accums vars
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_tvars]

# compute gradients for a batch
batch_grads_vars = opt.compute_gradients(loss, t_vars)
# collect the (scaled by const) batch gradient into accumulated vars 
accum_ops = [accum_tvars[i].assign_add(tf.scalar_mul(const, batch_grad_var[0]) for i, batch_grad_var in enumerate(batch_grads_vars)]

# apply accums gradients 
train_step = opt.apply_gradients([(accum_tvars[i], batch_grad_var[1]) for i, batch_grad_var in enumerate(batch_grads_vars)])
# train_step = opt.apply_gradients(zip(accum_tvars, zip(*batch_grads_vars)[1])

while True:
   # initialize the accumulated gards
   sess.run(zero_ops)

   # number of batches for gradient accumulation 
   n_batches = 3
   for i in xrange(n_batches):
       sess.run(accum_ops, feed_dict={X: x_init[:, i]})

   sess.run(train_step)
like image 131
Ishant Mrinal Avatar answered Oct 22 '22 11:10

Ishant Mrinal