What is the difference between tf.float16 and tf.bfloat16 as listed in https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types ?
Also, what do they mean by "quantized integer"?
The bfloat16 (Brain Floating Point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
reasons as to why there is no 2-byte float. It's called half-precision floating point in IEEE lingo, and implementations exist, just not in the C standard primitives (which C++ uses by extension).
Bfloat16 is a custom 16-bit floating point format for machine learning that's comprised of one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-standard IEEE 16-bit floating point, which was not designed with deep learning applications in mind.
fltInt16 = (fltInt32 & 0x007FFFFF) >> 13; fltInt16 |= (fltInt32 & 0x7c000000) >> 13; fltInt16 |= (fltInt32 & 0x80000000) >> 16; I'm hoping this is correct.
bfloat16
is a tensorflow-specific format that is different from IEEE's own float16
, hence the new name. The b
stands for (Google) Brain.
Basically, bfloat16
is a float32
truncated to its first 16 bits. So it has the same 8 bits for exponent, and only 7 bits for mantissa. It is therefore easy to convert from and to float32
, and because it has basically the same range as float32
, it minimizes the risks of having NaN
s or exploding/vanishing gradients when switching from float32
.
From the sources:
// Compact 16-bit encoding of floating point numbers. This representation uses // 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. It // is assumed that floats are in IEEE 754 format so the representation is just // bits 16-31 of a single precision float. // // NOTE: The IEEE floating point standard defines a float16 format that // is different than this format (it has fewer bits of exponent and more // bits of mantissa). We don't use that format here because conversion // to/from 32-bit floats is more complex for that format, and the // conversion for this format is very simple.
As for quantized integers, they are designed to replace floating points in trained networks to speed up processing. Basically, they are a sort of fixed point encoding of real numbers, albeit with an operating range that is chosen to represent the observed distribution at any given point of the net.
More on quantization here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With