Ways to implement multi-GPU BN layers with synchronizing means and vars

Tags:

I'd like to know the possible ways to implement batch normalization layers with synchronizing batch statistics when training with multi-GPU.

Caffe Maybe there are some variants of caffe that could do, like link. But for BN layer, my understanding is that it still synchronizes only the outputs of layers, not the means and vars. Maybe MPI can synchronizes means and vars but I think MPI is a little difficult to implemnt.

Torch I've seen some comments here and here, which show the running_mean and running_var can be synchronized but I think batch mean and batch var can not or are difficult to synchronize.

Tensorflow Normally, it is the same as caffe and torch. The implementation of BN refers this. I know tensorflow can distribute an operation to any device specified by tf.device(). But the computation of means and vars is in the middle of BN layer, so if I gather the means and vars in cpu, my code will be like this:

cpu_gather = []
label_batches = []
for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block1', reuse=i > 0):
            image_batch, label_batch = cifar_input.build_input('cifar10', train_data_path, batch_size, 'train')
            label_batches.append(label_batch)

            x = _conv('weights', image_batch, 3, 3, 16, _stride_arr(1))
            block1_gather.append(x)

with tf.device('/cpu:0'):
    print block1_gather[0].get_shape()
    x1 = tf.concat(block1_gather, 0)
    # print x1.get_shape()
    mean, variance = tf.nn.moments(x1, [0, 1, 2], name='moments')

for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block2', reuse=i > 0):
            shape = cpu_gather[i].get_shape().as_list()
            assert len(shape) in [2, 4]
            n_out = shape[-1]
            beta, gamma, moving_mean, moving_var = get_bn_variables(n_out, True, True)

            x = tf.nn.batch_normalization(
                cpu_gather[i], mean, variance, beta, gamma, 0.00001)

            x = _relu(x)

That is just for one BN layer. For gathering statistics in cpu, I have to break the code. If I have more than 100 BN layers, that will be cumbersome.

I am not expert in those libraries so maybe there are some misunderstanding, feel free to point out my errors.

I do not care much about training speed. I am doing image segmentation which consumes much GPU memory and BN needs a reasonable batch size (e.g. larger than 16) for stable statistics. So using multi-GPU is inevitable. In my opinion, tensorflow might be the best choice but I can't resolve the breaking code problem. Solution with other libraries will be welcome too.

451

asked Mar 27 '17 21:03

LI Xuhong

2 Answers

A specialized keras layer SyncBatchNormalization is available Since TF2.2 https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization

138

answered Sep 28 '22 02:09

Henrique Mendonça

I'm not sure if I fully understand your question, but provided you set up your variable scope properly, the tf.GraphKeys.UPDATE_OPS collection should automatically have the update ops for batch_norm for each of your towers. If all of the update_ops are applied synchronously, they will be implicitly averaged by the parameter server, all you have to do is make sure the updates are applied before you average and apply gradients. (If I understand your intentions correctly).

Because of variable scope each set of update ops will update the same variables, so to synchronize the update ops all you need to do is gate your gradient calculation on the complete set of update ops. You should also encapsulate all of your batch norm layers in a single name_scope to avoid grabbing any extraneous ops in UPDATE_OPS. Code skeleton below:

update_ops = []
for i, device in enumerate(devices):
  with tf.variable_scope('foo', reuse=bool(i > 0)):
    with tf.name_scope('tower_%d' % i) as name_scope:
      with tf.device(device):
        # Put as many batch_norm layers as you want here
      update_ops.extend(tf.get_collection(tf.GraphKeys.UPDATE_OPS,
                                          name_scope))
# make gradient calculation ops here
with tf.device(averaging_device):
  with tf.control_dependencies(update_ops):
    # average and apply gradients.

If you wanna try this on some existing code, try just deleting the if i == 0 line here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py#L115

You're going to see some slow down (we usually only use one tower to compute batch norm statistics for this reason), but it should do what you want.

answered Sep 28 '22 02:09

Eli Bixby

Related questions
                            
                                How to properly set steps_per_epoch and validation_steps in Keras?
                            
                                Changing CNN to work with 3D convolutions
                            
                                TF2 add report_tensor_allocations_upon_oom to RunOptions
                            
                                Hidden import Tensorflow package not found when using Pyinstaller
                            
                                Add weights to .pb file exported by TensorFlow
                            
                                List of headers to use Tensorflow C++ API using libtensorflow_cc.so
                            
                                Understanding tensorflow profiling results
                            
                                InvalidArgumentError: Mismatch between the current graph and the graph from the checkpoint
                            
                                Tensorflow Lite GPU support for python
                            
                                Make TensorFlow use training data generated on-the-fly by custom CUDA routine
                            
                                How to create a tensorflow serving client for the 'wide and deep' model?
                            
                                TensorFlow tf.reshape Fortran order (like numpy)
                            
                                Tensorflow Object Detection API on Windows - error "ModuleNotFoundError: No module named 'utils'"
                            
                                Tensorflow Estimator - warm_start_from and model_dir
                            
                                Get Keras model input from inside a custom callback
                            
                                Tensorflow installation using SSE instructions with pip
                            
                                Exception CallbackOnCollectedDelegate when creating tensorflow graph
                            
                                What does the property losses of the Bayesian layers of TensorFlow Probability represent?
                            
                                Loss Function is decreasing but metric function remains constant?
                            
                                Tensorflow: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Ways to implement multi-GPU BN layers with synchronizing means and vars

Tags:

tensorflow

torch

caffe

batch-normalization

multi-gpu

LI Xuhong

People also ask

2 Answers

Henrique Mendonça

Eli Bixby

Recent Activity

Donate For Us