I'd like to know the possible ways to implement batch normalization layers with synchronizing batch statistics when training with multi-GPU.
Caffe Maybe there are some variants of caffe that could do, like link. But for BN layer, my understanding is that it still synchronizes only the outputs of layers, not the means and vars. Maybe MPI can synchronizes means and vars but I think MPI is a little difficult to implemnt.
Torch I've seen some comments here and here, which show the running_mean and running_var can be synchronized but I think batch mean and batch var can not or are difficult to synchronize.
Tensorflow Normally, it is the same as caffe and torch. The implementation of BN refers this. I know tensorflow can distribute an operation to any device specified by tf.device()
. But the computation of means and vars is in the middle of BN layer, so if I gather the means and vars in cpu, my code will be like this:
cpu_gather = []
label_batches = []
for i in range(num_gpu):
with tf.device('/gpu:%d' % i):
with tf.variable_scope('block1', reuse=i > 0):
image_batch, label_batch = cifar_input.build_input('cifar10', train_data_path, batch_size, 'train')
label_batches.append(label_batch)
x = _conv('weights', image_batch, 3, 3, 16, _stride_arr(1))
block1_gather.append(x)
with tf.device('/cpu:0'):
print block1_gather[0].get_shape()
x1 = tf.concat(block1_gather, 0)
# print x1.get_shape()
mean, variance = tf.nn.moments(x1, [0, 1, 2], name='moments')
for i in range(num_gpu):
with tf.device('/gpu:%d' % i):
with tf.variable_scope('block2', reuse=i > 0):
shape = cpu_gather[i].get_shape().as_list()
assert len(shape) in [2, 4]
n_out = shape[-1]
beta, gamma, moving_mean, moving_var = get_bn_variables(n_out, True, True)
x = tf.nn.batch_normalization(
cpu_gather[i], mean, variance, beta, gamma, 0.00001)
x = _relu(x)
That is just for one BN layer. For gathering statistics in cpu, I have to break the code. If I have more than 100 BN layers, that will be cumbersome.
I am not expert in those libraries so maybe there are some misunderstanding, feel free to point out my errors.
I do not care much about training speed. I am doing image segmentation which consumes much GPU memory and BN needs a reasonable batch size (e.g. larger than 16) for stable statistics. So using multi-GPU is inevitable. In my opinion, tensorflow might be the best choice but I can't resolve the breaking code problem. Solution with other libraries will be welcome too.
To solve this issue we need to abandon the single-thread multiple GPUs programming model. Let’s assign each GPU to its own thread. By doing this, we are moving toward a multi-thread multi-GPU programming model. Besides, I wrote a wrapper for a chunk to reduce extra code and gather per-GPU data within one object.
We would distribute 5 layers to GPU 1, 5 layers to GPU 2 and like that till the last GPU. 2. Keras Data Parallelism We use this technique when we have insufficient memory to load the data points. The dataset is very big to be fit in the memory, so we divide the dataset into several batches. We train different batches of datasets on different GPUs.
There is another approach that I haven’t mentioned yet. The Cooperative Groups (CG) programming model describes synchronization patterns both within and across CUDA thread blocks. With CG it’s possible to launch a single kernel and synchronize all threads of GPU between different stages. In other words, CG extends __syncthreads to GPU scope.
Although an event can be recorded only when its GPU is current, it can be queried and synchronized with when a different GPU is current. This feature is widely used for GPU synchronization purposes.
A specialized keras layer SyncBatchNormalization is available Since TF2.2 https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization
I'm not sure if I fully understand your question, but provided you set up your variable scope properly, the tf.GraphKeys.UPDATE_OPS
collection should automatically have the update ops for batch_norm for each of your towers. If all of the update_ops are applied synchronously, they will be implicitly averaged by the parameter server, all you have to do is make sure the updates are applied before you average and apply gradients. (If I understand your intentions correctly).
Because of variable scope each set of update ops will update the same variables, so to synchronize the update ops all you need to do is gate your gradient calculation on the complete set of update ops. You should also encapsulate all of your batch norm layers in a single name_scope
to avoid grabbing any extraneous ops in UPDATE_OPS
. Code skeleton below:
update_ops = []
for i, device in enumerate(devices):
with tf.variable_scope('foo', reuse=bool(i > 0)):
with tf.name_scope('tower_%d' % i) as name_scope:
with tf.device(device):
# Put as many batch_norm layers as you want here
update_ops.extend(tf.get_collection(tf.GraphKeys.UPDATE_OPS,
name_scope))
# make gradient calculation ops here
with tf.device(averaging_device):
with tf.control_dependencies(update_ops):
# average and apply gradients.
If you wanna try this on some existing code, try just deleting the if i == 0
line here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py#L115
You're going to see some slow down (we usually only use one tower to compute batch norm statistics for this reason), but it should do what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With