Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is tf.bfloat16 "truncated 16-bit floating point"?

Tags:

tensorflow

What is the difference between tf.float16 and tf.bfloat16 as listed in https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types ?

Also, what do they mean by "quantized integer"?

like image 811
JMC Avatar asked Jul 02 '17 18:07

JMC


People also ask

What does 16 bit float mean?

The bfloat16 (Brain Floating Point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Is there a 2 byte float?

reasons as to why there is no 2-byte float. It's called half-precision floating point in IEEE lingo, and implementations exist, just not in the C standard primitives (which C++ uses by extension).

What is Bfloat?

Bfloat16 is a custom 16-bit floating point format for machine learning that's comprised of one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-standard IEEE 16-bit floating point, which was not designed with deep learning applications in mind.

How do I convert float16 to float32?

fltInt16 = (fltInt32 & 0x007FFFFF) >> 13; fltInt16 |= (fltInt32 & 0x7c000000) >> 13; fltInt16 |= (fltInt32 & 0x80000000) >> 16; I'm hoping this is correct.


1 Answers

bfloat16 is a tensorflow-specific format that is different from IEEE's own float16, hence the new name. The b stands for (Google) Brain.

Basically, bfloat16 is a float32 truncated to its first 16 bits. So it has the same 8 bits for exponent, and only 7 bits for mantissa. It is therefore easy to convert from and to float32, and because it has basically the same range as float32, it minimizes the risks of having NaNs or exploding/vanishing gradients when switching from float32.

From the sources:

// Compact 16-bit encoding of floating point numbers. This representation uses
// 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa.  It
// is assumed that floats are in IEEE 754 format so the representation is just
// bits 16-31 of a single precision float.
//
// NOTE: The IEEE floating point standard defines a float16 format that
// is different than this format (it has fewer bits of exponent and more
// bits of mantissa).  We don't use that format here because conversion
// to/from 32-bit floats is more complex for that format, and the
// conversion for this format is very simple.

As for quantized integers, they are designed to replace floating points in trained networks to speed up processing. Basically, they are a sort of fixed point encoding of real numbers, albeit with an operating range that is chosen to represent the observed distribution at any given point of the net.

More on quantization here.

like image 119
P-Gn Avatar answered Nov 22 '22 14:11

P-Gn