Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Very weird behaviour when running the same deep learning code in two different GPUs

Tags:

gpu

pytorch

I am training networks using pytorch framework. I had K40 GPU in my computer. Last week, I added 1080 to the same computer.

In my first experiment, I observed identical results in both GPU. Then, I tried my second code on both GPUs. In this case, I "constantly" got good results in K40 while getting "constantly" awful results in 1080 for "exactly the same code".

First, I thought the only reason for getting such diverse outputs would be the random seeds in the codes. So, I fixed the seeds like this:

torch.manual_seed(3)
torch.cuda.manual_seed_all(3)
numpy.random.seed(3)

But, this did not solve the issue. I believe issue cannot be randomness because I was "constantly" getting good results in K40 and "constantly" getting bad results in 1080. Moreover, I tried exactly the same code in 2 other computers and 4 other 1080 GPUs and always achieved good results. So, problem has to be about the 1080 I recently plugged in.

I suspect problem might be about driver, or the way I installed pytorch. But, it is still weird that I only get bad results for "some" of the experiments. For the other experiments, I had the identical results.

Can anyone help me on this?

like image 373
kko Avatar asked Sep 20 '25 11:09

kko


2 Answers

Q: can you please tell what type of experiment this is.. and what architecture of NN you use ?

In below tips, I will assume you are running a straight backpropagation neural net.

  • You say learning of your test experiment is "unstable" ? Training of a NN should not be "unstable". When it is, different processors could end up with a different outcome, influenced by numeric precision and rounding errors. Saturation could have occurred.. Check if your weight values have become too large. In that case 1) check if your training input and output are logically consistent, and 2) add more neurons in hidden layers and train again.

  • Good idea to check random() calls, but take into account that in a backprop NN there are several places random() functions can be used. Some backprop NN's also add dynamic noise to training patterns, to prevent early saturation of weights. When this training noise is scaled wrong, you could get bizarre results. When the noise is not added or too small, you could end up with saturation.

like image 85
Goodies Avatar answered Sep 22 '25 08:09

Goodies


I had the same problem. I solved the problem by simply changing

sum

to

torch.sum

. Please try to change all the build-in functions to GPU one.

like image 25
dimo Avatar answered Sep 22 '25 09:09

dimo