The standard is float32 but I'm wondering under what conditions it's ok to use float16?
I've compared running the same covnet with both datatypes and haven't noticed any issues. With large dataset I prefer float16 because I can worry less about memory issues..
Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.
The most popular deep learning library TensorFlow by default uses 32 bit floating point precision.
Using mixed precision can improve performance by more than 3 times on modern GPUs and 60% on TPUs. Today, most models use the float32 dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, float16 and bfloat16 , each which take 16 bits of memory instead.
Benefits of Mixed precision trainingSpeeds up memory-limited operations by accessing half the bytes compared to single-precision. Reduces memory requirements for training models, enabling larger models or larger minibatches.
Surprisingly, it's totally OK to use 16 bits, even not just for fun, but in production as well. For example, in this video Jeff Dean talks about 16-bit calculations at Google, around 52:00. A quote from the slides:
Neural net training very tolerant of reduced precision
Since GPU memory is the main bottleneck in ML computation, there has been a lot of research on precision reduction. E.g.
Gupta at al paper "Deep Learning with Limited Numerical Precision" about fixed (not floating) 16-bit training but with stochastic rounding.
Courbariaux at al "Training Deep Neural Networks with Low Precision Multiplications" about 10-bit activations and 12-bit parameter updates.
And this is not the limit. Courbariaux et al, "BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1". Here they discuss 1-bit activations and weights (though higher precision for the gradients), which makes the forward pass super fast.
Of course, I can imagine some networks may require high precision for training, but I would recommend at least to try 16 bits for training a big network and switch to 32 bits if it proves to work worse.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With