Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ways to implement multi-GPU BN layers with synchronizing means and vars

I'd like to know the possible ways to implement batch normalization layers with synchronizing batch statistics when training with multi-GPU.

Caffe Maybe there are some variants of caffe that could do, like link. But for BN layer, my understanding is that it still synchronizes only the outputs of layers, not the means and vars. Maybe MPI can synchronizes means and vars but I think MPI is a little difficult to implemnt.

Torch I've seen some comments here and here, which show the running_mean and running_var can be synchronized but I think batch mean and batch var can not or are difficult to synchronize.

Tensorflow Normally, it is the same as caffe and torch. The implementation of BN refers this. I know tensorflow can distribute an operation to any device specified by tf.device(). But the computation of means and vars is in the middle of BN layer, so if I gather the means and vars in cpu, my code will be like this:

cpu_gather = []
label_batches = []
for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block1', reuse=i > 0):
            image_batch, label_batch = cifar_input.build_input('cifar10', train_data_path, batch_size, 'train')
            label_batches.append(label_batch)

            x = _conv('weights', image_batch, 3, 3, 16, _stride_arr(1))
            block1_gather.append(x)

with tf.device('/cpu:0'):
    print block1_gather[0].get_shape()
    x1 = tf.concat(block1_gather, 0)
    # print x1.get_shape()
    mean, variance = tf.nn.moments(x1, [0, 1, 2], name='moments')

for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block2', reuse=i > 0):
            shape = cpu_gather[i].get_shape().as_list()
            assert len(shape) in [2, 4]
            n_out = shape[-1]
            beta, gamma, moving_mean, moving_var = get_bn_variables(n_out, True, True)

            x = tf.nn.batch_normalization(
                cpu_gather[i], mean, variance, beta, gamma, 0.00001)

            x = _relu(x)

That is just for one BN layer. For gathering statistics in cpu, I have to break the code. If I have more than 100 BN layers, that will be cumbersome.

I am not expert in those libraries so maybe there are some misunderstanding, feel free to point out my errors.

I do not care much about training speed. I am doing image segmentation which consumes much GPU memory and BN needs a reasonable batch size (e.g. larger than 16) for stable statistics. So using multi-GPU is inevitable. In my opinion, tensorflow might be the best choice but I can't resolve the breaking code problem. Solution with other libraries will be welcome too.

like image 451
LI Xuhong Avatar asked Mar 27 '17 21:03

LI Xuhong


People also ask

How to solve the problem of multiple GPUs per GPU?

To solve this issue we need to abandon the single-thread multiple GPUs programming model. Let’s assign each GPU to its own thread. By doing this, we are moving toward a multi-thread multi-GPU programming model. Besides, I wrote a wrapper for a chunk to reduce extra code and gather per-GPU data within one object.

How to distribute layers of data to different GPUs?

We would distribute 5 layers to GPU 1, 5 layers to GPU 2 and like that till the last GPU. 2. Keras Data Parallelism We use this technique when we have insufficient memory to load the data points. The dataset is very big to be fit in the memory, so we divide the dataset into several batches. We train different batches of datasets on different GPUs.

What is the best way to synchronize multiple threads of GPU?

There is another approach that I haven’t mentioned yet. The Cooperative Groups (CG) programming model describes synchronization patterns both within and across CUDA thread blocks. With CG it’s possible to launch a single kernel and synchronize all threads of GPU between different stages. In other words, CG extends __syncthreads to GPU scope.

Can I synchronize an event with a different GPU?

Although an event can be recorded only when its GPU is current, it can be queried and synchronized with when a different GPU is current. This feature is widely used for GPU synchronization purposes.


2 Answers

A specialized keras layer SyncBatchNormalization is available Since TF2.2 https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization

like image 138
Henrique Mendonça Avatar answered Sep 28 '22 02:09

Henrique Mendonça


I'm not sure if I fully understand your question, but provided you set up your variable scope properly, the tf.GraphKeys.UPDATE_OPS collection should automatically have the update ops for batch_norm for each of your towers. If all of the update_ops are applied synchronously, they will be implicitly averaged by the parameter server, all you have to do is make sure the updates are applied before you average and apply gradients. (If I understand your intentions correctly).

Because of variable scope each set of update ops will update the same variables, so to synchronize the update ops all you need to do is gate your gradient calculation on the complete set of update ops. You should also encapsulate all of your batch norm layers in a single name_scope to avoid grabbing any extraneous ops in UPDATE_OPS. Code skeleton below:

update_ops = []
for i, device in enumerate(devices):
  with tf.variable_scope('foo', reuse=bool(i > 0)):
    with tf.name_scope('tower_%d' % i) as name_scope:
      with tf.device(device):
        # Put as many batch_norm layers as you want here
      update_ops.extend(tf.get_collection(tf.GraphKeys.UPDATE_OPS,
                                          name_scope))
# make gradient calculation ops here
with tf.device(averaging_device):
  with tf.control_dependencies(update_ops):
    # average and apply gradients.

If you wanna try this on some existing code, try just deleting the if i == 0 line here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py#L115

You're going to see some slow down (we usually only use one tower to compute batch norm statistics for this reason), but it should do what you want.

like image 39
Eli Bixby Avatar answered Sep 28 '22 02:09

Eli Bixby