Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi GPU architecture, gradient averaging - less accurate model?

When I execute the cifar10 model as described at https://www.tensorflow.org/tutorials/deep_cnn I achieve 86% accuracy after approx 4 hours using a single GPU , when I utilize 2 GPU's the accuracy drops to 84% but reaching 84% accuracy is faster on 2 GPU's than 1.

My intuition is that average_gradients function as defined at https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py returns a less accurate gradient value as an average of gradients will be less accurate than the actual gradient value.

If the gradients are less accurate then the parameters than control the function that is learned as part of training is less accurate. Looking at the code (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) why is averaging the gradients over multiple GPU's less accurate than computing the gradient on a single GPU ?

Is my intuition of averaging the gradients producing a less accurate value correct ?

Randomness in the model is described as :

The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:

Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.

src : https://www.tensorflow.org/tutorials/deep_cnn

Does this have an effect on training accuracy ?

Update :

Attempting to investigate this further, the loss function value training with different number of GPU's.

Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU's : loss value : .5 , Accuracy : 84%

Shouldn't the loss value be lower for higher for higher accuracy, not vice versa ?

like image 715
blue-sky Avatar asked May 08 '17 10:05

blue-sky


People also ask

What is multi-GPU model?

Model parallelism partitions a model among multiple GPUs, where each GPU is responsible for the weight updates of the assigned layers of a model. Intermediate data such as outputs from a layer of a neural network for the forward pass and gradients for the backward pass are transferred among GPUs.

How do I run multiple GPU models?

DataParallel—enables you to distribute model replicas across multiple GPUs in a single machine. You can then use these models to process different subsets of your data set. DistributedDataParallel—extends the DataParallel class to enable you to distribute model replicas across machines in addition to GPUs.

How does PyTorch use multiple GPUs?

To use data parallelism with PyTorch, you can use the DataParallel class. When using this class, you define your GPU IDs and initialize your network using a Module object with a DataParallel object. Then, when you call your object it can split your dataset into batches that are distributed across your defined GPUs.


2 Answers

In the code you linked, using the function average_gradient with 2 GPUs is exactly equivalent (1) to simply using 1 GPU with twice the batch size.

You can see it in the definition:

grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)

Using a larger batch size (given the same number of epochs) can have any kind of effect on your results.

Therefore, if you want to do exactly equivalent (1) calculations in 1-GPU or 2-GPU cases, you may want to halve the batch size in the latter case. (People sometimes avoid doing it, because smaller batch sizes may also make the computation on each GPU slower, in some cases)

Additionally, one needs to be careful with learning rate decay here. If you use it, you want to make sure the learning rate is the same in the nth epoch in both 1-GPU and 2-GPU cases -- I'm not entirely sure this code is doing the right thing here. I tend to print the learning rate in the logs, something like

print sess.run(lr)

should work here.

(1) Ignoring issues related to pseudo-random numbers, finite precision or data set sizes not divisible by the batch size.

like image 103
MWB Avatar answered Nov 14 '22 04:11

MWB


There is a decent discussion of this here (not my content). Basically when you distribute SGD, you have to communicate gradients back and forth somehow between workers. This is inherently imperfect, and so your distributed SGD typically diverges from a sequential, single-worker SGD at least to some degree. It is also typically faster, so there is a trade off.

[Zhang et. al., 2015] proposes one method for distributed SGD called elastic-averaged SGD. The paper goes through a stability analysis characterizing the behavior of the gradients under different communication constraints. It gets a little heavy, but it might shed some light on why you see this behavior.

Edit: regarding whether the loss should be lower for the higher accuracy, it is going to depend on a couple of things. First, I am assuming that you are using softmax cross-entropy for your loss (as stated in the deep_cnn tutorial you linked), and assuming accuracy is the total number of correct predictions divided by the total number of samples. In this case, a lower loss on the same dataset should correlate to a higher accuracy. The emphasis is important.

If you are reporting loss during training but then report accuracy on your validation (or testing) dataset, it is possible for these two to be only loosely correlated. This is because the model is fitting (minimizing loss) to a certain subset of your total samples throughout the training process, and then tests against new samples that it has never seen before to verify that it generalizes well. The loss against this testing/validation set could be (and probably is) higher than the loss against the training set, so if the two numbers are being reported from different sets, you may not be able to draw comparisons like "loss for 1 GPU case should be lower since its accuracy is lower".

Second, if you are distributing the training then you are calculating losses across multiple workers (I believe), but only one accuracy at the end, again against a testing or validation set. Maybe the loss being reported is the best loss seen by any one worker, but overall the average losses were higher.

Basically I do not think we have enough information to decisively say why the loss and accuracy do not seem to correlate the way you expect, but there are a number of ways this could be happening, so I wouldn't dismiss it out of hand.

like image 5
Engineero Avatar answered Nov 14 '22 02:11

Engineero