Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mixed precision not enabled with TF1.4 on Tesla V100

I was interested in testing my neural net (an Autoencoder that serves as a generator + a CNN as a discriminator) that uses 3dconv/deconv layers with the new Volta architecture and benefit from the Mixed-Precision training. I compiled the most recent source code of Tensorflow 1.4 with CUDA 9 and CudNN 7.0 and cast all the trainable variables used by my conv/deconv layers to tf.float16. Also, all my input and output tensors have sizes that are multiple of 8.

Unfortunately, I do not see any substantial speed improvement with this configuration, the training time is roughly similar to when using tf.float32. My understanding is that with the Volta architecture and cuDNN 7.0, Mixed Precision should be automatically detected by TF and hence enable the use of Tensor Core math. Am I wrong, or is there anything I should do to enable it? I also tried the TF1.5 nighlty build, and it seems that it is even slower than my custom 1.4.

I would appreciate if any dev involved in Tensorflow could answer this.

EDIT: After talking with NVIDIA tech support, it seems that, while supporting float16,TF integrates mixed-precision acceleration for simple 2D conv Ops, but not for 3D conv Ops as of now.

like image 814
Julien Jorda Avatar asked Nov 07 '17 21:11

Julien Jorda


1 Answers

Based on NVIDIA documentation I run benchmark with FP16 (TensorCore). For that I modyfied alexnet_benchmark delivered by tensorflow: https://gist.github.com/melgor/946b9643aa25dd3839a86804fc580741

Overall, AlexNet is only 35% faster, not so much. I was hoping to get ~2x faster. Also, maybe Resnet would make bigger differnce. The nice thing is that I can fit model with batch_size = 5120 (fp32 cannot), one FB pass take 0.653, so training ImageNet by 90 epochs take ~4h.

batch_size=512 alexnet_fp32: Forward-backward across 100 steps, 0.099 +/- 0.000 sec / batch alexnet_fp16: Forward-backward across 100 steps, 0.064 +/- 0.000 sec / batch

Edit:

I manage to run ResNet models on FP16 (but without BatchNorm, for some reason BN does not work with fp16):

batch_size=256 resnet50_fp32: Forward-backward across 100 steps, 0.575 +/- 0.001 sec / batch resnet50_fp16: Forward-backward across 100 steps, 0.504 +/- 0.001 sec / batch

batch_size=128 resnet152_fp32: Forward-backward across 100 steps, 0.757 +/- 0.001 sec / batch resnet152_fp16: Forward-backward across 100 steps, 0.581 +/- 0.010 sec / batch

The gain at ResNet is even smaller. Look like FP16 does not have a lot of gain at V100, not sure why. Maybe the suppor of TensorCore is not fully intergrated currently.

like image 91
melgor89 Avatar answered Oct 03 '22 05:10

melgor89