Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch: training with GPU gives worse error than training the same thing with CPU

I have a next step prediction model on times series which is simply a GRU with a fully-connected layer on top of it. When I train it using CPU after 50 epochs I get a loss of 0.10 but when I train it with GPU the loss is 0.15 after 50 epochs. Doing more epochs doesnt really lower the losses in either cases.

Why is performance after training on CPU better than GPU?

I have tried changing the random seeds for both data and model, and these results are independent of the random seeds.

I have:

Python 3.6.2

PyTorch 0.3.0

CUDNN_MAJOR 7

CUDNN_MINOR 0

CUDNN_PATCHLEVEL 5

Edit:

I also use PyTorch's weight normalizaton torch.nn.utils.weight_norm on the GRU and on the fully-connected layer.

like image 538
patapouf_ai Avatar asked Jan 25 '18 15:01

patapouf_ai


People also ask

Does PyTorch support CPU and GPU usage?

PyTorch provides a simple to use API to transfer the tensor generated on CPU to GPU. Luckily the new tensors are generated on the same device as the parent tensor.

Does PyTorch use GPU by default?

By default, within PyTorch, you cannot use cross-GPU operations. The exception is the use of copy_() or copy-like methods, such as to() and cuda(). To launch operations across distributed tensors, you must first enable peer-to-peer memory access.

Can I run PyTorch on CPU?

The Power Of Pytorch Because PyTorch is a powerful open-source platform for deep learning, the code can run on both CPUs and GPUs.


1 Answers

After trying many things I think I found the problem. Apparently the CUDNN libraries are sub-optimal in PyTorch. I don't know if it is a bug in PyTorch or a bug in CUDNN but doing

torch.backends.cudnn.enabled = False

solves the problem. With the above line, training with GPU or CPU gives the same loss at the same epoch.

Edit:

It seems that it is the interaction of weight normalization and CUDNN which results in things going wrong. If I remove weight normalization it works. If I remove CUDNN it works. It seems that only in combination they do not work in PyTorch.

like image 113
patapouf_ai Avatar answered Oct 20 '22 01:10

patapouf_ai