I was interested in testing my neural net (an Autoencoder that serves as a generator + a CNN as a discriminator) that uses 3dconv/deconv layers with the new Volta architecture and benefit from the Mixed-Precision training. I compiled the most recent source code of Tensorflow 1.4 with CUDA 9 and CudNN 7.0 and cast all the trainable variables used by my conv/deconv layers to tf.float16. Also, all my input and output tensors have sizes that are multiple of 8.
Unfortunately, I do not see any substantial speed improvement with this configuration, the training time is roughly similar to when using tf.float32. My understanding is that with the Volta architecture and cuDNN 7.0, Mixed Precision should be automatically detected by TF and hence enable the use of Tensor Core math. Am I wrong, or is there anything I should do to enable it? I also tried the TF1.5 nighlty build, and it seems that it is even slower than my custom 1.4.
I would appreciate if any dev involved in Tensorflow could answer this.
EDIT: After talking with NVIDIA tech support, it seems that, while supporting float16,TF integrates mixed-precision acceleration for simple 2D conv Ops, but not for 3D conv Ops as of now.
Based on NVIDIA documentation I run benchmark with FP16 (TensorCore). For that I modyfied alexnet_benchmark
delivered by tensorflow:
https://gist.github.com/melgor/946b9643aa25dd3839a86804fc580741
Overall, AlexNet is only 35% faster, not so much. I was hoping to get ~2x faster. Also, maybe Resnet would make bigger differnce. The nice thing is that I can fit model with batch_size = 5120 (fp32 cannot), one FB pass take 0.653, so training ImageNet by 90 epochs take ~4h.
batch_size=512
alexnet_fp32: Forward-backward across 100 steps, 0.099 +/- 0.000 sec / batch
alexnet_fp16: Forward-backward across 100 steps, 0.064 +/- 0.000 sec / batch
Edit:
I manage to run ResNet models on FP16 (but without BatchNorm, for some reason BN does not work with fp16):
batch_size=256
resnet50_fp32: Forward-backward across 100 steps, 0.575 +/- 0.001 sec / batch
resnet50_fp16: Forward-backward across 100 steps, 0.504 +/- 0.001 sec / batch
batch_size=128
resnet152_fp32: Forward-backward across 100 steps, 0.757 +/- 0.001 sec / batch
resnet152_fp16: Forward-backward across 100 steps, 0.581 +/- 0.010 sec / batch
The gain at ResNet is even smaller. Look like FP16 does not have a lot of gain at V100, not sure why. Maybe the suppor of TensorCore is not fully intergrated currently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With