Multi GPU architecture, gradient averaging - less accurate model?

Tags:

When I execute the cifar10 model as described at https://www.tensorflow.org/tutorials/deep_cnn I achieve 86% accuracy after approx 4 hours using a single GPU , when I utilize 2 GPU's the accuracy drops to 84% but reaching 84% accuracy is faster on 2 GPU's than 1.

My intuition is that average_gradients function as defined at https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py returns a less accurate gradient value as an average of gradients will be less accurate than the actual gradient value.

If the gradients are less accurate then the parameters than control the function that is learned as part of training is less accurate. Looking at the code (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) why is averaging the gradients over multiple GPU's less accurate than computing the gradient on a single GPU ?

Is my intuition of averaging the gradients producing a less accurate value correct ?

Randomness in the model is described as :

The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:

Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.

src : https://www.tensorflow.org/tutorials/deep_cnn

Does this have an effect on training accuracy ?

Update :

Attempting to investigate this further, the loss function value training with different number of GPU's.

Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU's : loss value : .5 , Accuracy : 84%

Shouldn't the loss value be lower for higher for higher accuracy, not vice versa ?

715

asked May 08 '17 10:05

blue-sky

2 Answers

In the code you linked, using the function average_gradient with 2 GPUs is exactly equivalent (1) to simply using 1 GPU with twice the batch size.

You can see it in the definition:

grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)

Using a larger batch size (given the same number of epochs) can have any kind of effect on your results.

Therefore, if you want to do exactly equivalent (1) calculations in 1-GPU or 2-GPU cases, you may want to halve the batch size in the latter case. (People sometimes avoid doing it, because smaller batch sizes may also make the computation on each GPU slower, in some cases)

Additionally, one needs to be careful with learning rate decay here. If you use it, you want to make sure the learning rate is the same in the nth epoch in both 1-GPU and 2-GPU cases -- I'm not entirely sure this code is doing the right thing here. I tend to print the learning rate in the logs, something like

print sess.run(lr)

should work here.

(1) Ignoring issues related to pseudo-random numbers, finite precision or data set sizes not divisible by the batch size.

103

answered Nov 14 '22 04:11

MWB

There is a decent discussion of this here (not my content). Basically when you distribute SGD, you have to communicate gradients back and forth somehow between workers. This is inherently imperfect, and so your distributed SGD typically diverges from a sequential, single-worker SGD at least to some degree. It is also typically faster, so there is a trade off.

[Zhang et. al., 2015] proposes one method for distributed SGD called elastic-averaged SGD. The paper goes through a stability analysis characterizing the behavior of the gradients under different communication constraints. It gets a little heavy, but it might shed some light on why you see this behavior.

Edit: regarding whether the loss should be lower for the higher accuracy, it is going to depend on a couple of things. First, I am assuming that you are using softmax cross-entropy for your loss (as stated in the deep_cnn tutorial you linked), and assuming accuracy is the total number of correct predictions divided by the total number of samples. In this case, a lower loss on the same dataset should correlate to a higher accuracy. The emphasis is important.

If you are reporting loss during training but then report accuracy on your validation (or testing) dataset, it is possible for these two to be only loosely correlated. This is because the model is fitting (minimizing loss) to a certain subset of your total samples throughout the training process, and then tests against new samples that it has never seen before to verify that it generalizes well. The loss against this testing/validation set could be (and probably is) higher than the loss against the training set, so if the two numbers are being reported from different sets, you may not be able to draw comparisons like "loss for 1 GPU case should be lower since its accuracy is lower".

Second, if you are distributing the training then you are calculating losses across multiple workers (I believe), but only one accuracy at the end, again against a testing or validation set. Maybe the loss being reported is the best loss seen by any one worker, but overall the average losses were higher.

Basically I do not think we have enough information to decisively say why the loss and accuracy do not seem to correlate the way you expect, but there are a number of ways this could be happening, so I wouldn't dismiss it out of hand.

answered Nov 14 '22 02:11

Engineero

Related questions
                            
                                Get info of exposed models in Tensorflow Serving
                            
                                Tensorflow import error
                            
                                ImportError: No module named 'tensorflow.python' with tensorflow-gpu
                            
                                Cannot batch tensors with different shapes in component 0 with tf.data.Dataset
                            
                                How can we convert a .pth model into .pb file?
                            
                                What is tape-based autograd in Pytorch?
                            
                                How to read images with different size in a TFRecord file
                            
                                TensorFlow or Theano: how do they know the loss function derivative based on the neural network graph?
                            
                                How do you pass video features from a CNN to an LSTM?
                            
                                tensorflow and tensorboard: step vs relative
                            
                                ImportError: No module named datasets
                            
                                Trouble understanding tf.contrib.seq2seq.TrainingHelper
                            
                                Tensorflow Invalid Argument: Assertation Failed [Label IDs must < n_classes]
                            
                                Tensorflow reshape tensor
                            
                                How to calculate AUC with tensorflow?
                            
                                What does use_locking=True do in TensorFlow optimizers?
                            
                                Tensorflow fail with "Unable to get element from the feed as bytes." when attempting to restore checkpoint
                            
                                Implementation of model parallelism in tensorflow
                            
                                How to use freeze_graph.py tool in TensorFlow v1
                            
                                tensorflow: check if a scalar boolean tensor is True

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multi GPU architecture, gradient averaging - less accurate model?

Tags:

neural-network

tensorflow

blue-sky

People also ask

2 Answers

MWB

Engineero

Recent Activity

Donate For Us