Is there a Java library anywhere that can perform computations on IEEE 754 half-precision numbers or convert them to and from double-precision? Either of these approaches would be suitable: <ul> <li>Keep the numbers in half-precision format and compute using integer arithmetic & bit-twiddling (as MicroFloat does for single- and double-precision)</li> <li>Perform all computations in single or double precision, converting to/from half precision for transmission (in which case what I need is well-tested conversion functions.)</li> </ul> <hr> Edit: conversion needs to be 100% accurate - there are lots of NaNs, infinities and subnormals in the input files. <hr> Related question but for JavaScript: Decompressing Half Precision Floats in Javascript

You can Use <code>Float.intBitsToFloat()</code> and <code>Float.floatToIntBits()</code> to convert them to and from primitive float values. If you can live with truncated precision (as opposed to rounding) the conversion should be possible to implement with just a few bit shifts. I have now put a little more effort into it and it turned out not quite as simple as I expected at the beginning. This version is now tested and verified in every aspect I could imagine and I'm very confident that it produces the exact results for all possible input values. It supports exact rounding and subnormal conversion in either direction. <pre class="prettyprint"><code>// ignores the higher 16 bits public static float toFloat( int hbits ) { int mant = hbits & 0x03ff; // 10 bits mantissa int exp = hbits & 0x7c00; // 5 bits exponent if( exp == 0x7c00 ) // NaN/Inf exp = 0x3fc00; // -> NaN/Inf else if( exp != 0 ) // normalized value { exp += 0x1c000; // exp - 15 + 127 if( mant == 0 && exp > 0x1c400 ) // smooth transition return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff ); } else if( mant != 0 ) // && exp==0 -> subnormal { exp = 0x1c400; // make it normal do { mant <<= 1; // mantissa * 2 exp -= 0x400; // decrease exp by 1 } while( ( mant & 0x400 ) == 0 ); // while not normal mant &= 0x3ff; // discard subnormal bit } // else +/-0 -> +/-0 return Float.intBitsToFloat( // combine all parts ( hbits & 0x8000 ) << 16 // sign << ( 31 - 15 ) | ( exp | mant ) << 13 ); // value << ( 23 - 10 ) } </code></pre> <hr> <pre class="prettyprint"><code>// returns all higher 16 bits as 0 for all results public static int fromFloat( float fval ) { int fbits = Float.floatToIntBits( fval ); int sign = fbits >>> 16 & 0x8000; // sign only int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value if( val >= 0x47800000 ) // might be or become NaN/Inf { // avoid Inf due to rounding if( ( fbits & 0x7fffffff ) >= 0x47800000 ) { // is or must become NaN/Inf if( val < 0x7f800000 ) // was value but too large return sign | 0x7c00; // make it +/-Inf return sign | 0x7c00 | // remains +/-Inf or NaN ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits } return sign | 0x7bff; // unrounded not quite Inf } if( val >= 0x38800000 ) // remains normalized value return sign | val - 0x38000000 >>> 13; // exp - 127 + 15 if( val < 0x33000000 ) // too small for subnormal return sign; // becomes +/-0 val = ( fbits & 0x7fffffff ) >>> 23; // tmp exp for subnormal calc return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit + ( 0x800000 >>> val - 102 ) // round depending on cut off >>> 126 - val ); // div by 2^(1-(exp-127+15)) and >> 13 | exp=0 } </code></pre> I implemented two small extensions compared to the book because the general precision for 16 bit floats is rather low which could make the inherent anomalies of floating point formats visually perceivable compared to larger floating point types where they are usually not noticed due to the ample precision. The first one are these two lines in the <code>toFloat()</code> function: <pre class="prettyprint"><code>if( mant == 0 && exp > 0x1c400 ) // smooth transition return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff ); </code></pre> Floating point numbers in the normal range of the type size adopt the exponent and thus the precision to the magnitude of the value. But this is not a smooth adoption, it happens in steps: switching to the next higher exponent results in half the precision. The precision now remains the same for all values of the mantissa until the next jump to the next higher exponent. The extension code above makes these transitions smoother by returning a value that is in the geographical center of the covered 32 bit float range for this particular half float value. Every normal half float value maps to exactly 8192 32 bit float values. The returned value is supposed to be exactly in the middle of these values. But at the transition of the half float exponent the lower 4096 values have twice the precision as the upper 4096 values and thus cover a number space that is only half as large as on the other side. All these 8192 32 bit float values map to the same half float value, so converting a half float to 32 bit and back results in the same half float value regardless of which of the 8192 intermediate 32 bit values was chosen. The extension now results in something like a smoother half step by a factor of sqrt(2) at the transition as shown at the right picture below while the left picture is supposed to visualize the sharp step by a factor of two without anti aliasing. You can safely remove these two lines from the code to get the standard behavior. <pre class="prettyprint"><code>covered number space on either side of the returned value: 6.0E-8 ####### ########## 4.5E-8 | # 3.0E-8 ######### ######## </code></pre> The second extension is in the <code>fromFloat()</code> function: <pre class="prettyprint"><code> { // avoid Inf due to rounding if( ( fbits & 0x7fffffff ) >= 0x47800000 ) ... return sign | 0x7bff; // unrounded not quite Inf } </code></pre> This extension slightly extends the number range of the half float format by saving some 32 bit values form getting promoted to Infinity. The affected values are those that would have been smaller than Infinity without rounding and would become Infinity only due to the rounding. You can safely remove the lines shown above if you don't want this extension. I tried to optimize the path for normal values in the <code>fromFloat()</code> function as much as possible which made it a bit less readable due to the use of precomputed and unshifted constants. I didn't put as much effort into 'toFloat()' since it would not exceed the performance of a lookup table anyway. So if speed really matters could use the <code>toFloat()</code> function only to fill a static lookup table with 0x10000 elements and than use this table for the actual conversion. This is about 3 times faster with a current x64 server VM and about 5 times faster with the x86 client VM. I put the code hereby into public domain.

Half-precision floating-point in Java

2 Answers

You can Use Float.intBitsToFloat() and Float.floatToIntBits() to convert them to and from primitive float values. If you can live with truncated precision (as opposed to rounding) the conversion should be possible to implement with just a few bit shifts.

I have now put a little more effort into it and it turned out not quite as simple as I expected at the beginning. This version is now tested and verified in every aspect I could imagine and I'm very confident that it produces the exact results for all possible input values. It supports exact rounding and subnormal conversion in either direction.

// ignores the higher 16 bits public static float toFloat( int hbits ) {     int mant = hbits & 0x03ff;            // 10 bits mantissa     int exp =  hbits & 0x7c00;            // 5 bits exponent     if( exp == 0x7c00 )                   // NaN/Inf         exp = 0x3fc00;                    // -> NaN/Inf     else if( exp != 0 )                   // normalized value     {         exp += 0x1c000;                   // exp - 15 + 127         if( mant == 0 && exp > 0x1c400 )  // smooth transition             return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16                                             | exp << 13 | 0x3ff );     }     else if( mant != 0 )                  // && exp==0 -> subnormal     {         exp = 0x1c400;                    // make it normal         do {             mant <<= 1;                   // mantissa * 2             exp -= 0x400;                 // decrease exp by 1         } while( ( mant & 0x400 ) == 0 ); // while not normal         mant &= 0x3ff;                    // discard subnormal bit     }                                     // else +/-0 -> +/-0     return Float.intBitsToFloat(          // combine all parts         ( hbits & 0x8000 ) << 16          // sign  << ( 31 - 15 )         | ( exp | mant ) << 13 );         // value << ( 23 - 10 ) }

// returns all higher 16 bits as 0 for all results public static int fromFloat( float fval ) {     int fbits = Float.floatToIntBits( fval );     int sign = fbits >>> 16 & 0x8000;          // sign only     int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value      if( val >= 0x47800000 )               // might be or become NaN/Inf     {                                     // avoid Inf due to rounding         if( ( fbits & 0x7fffffff ) >= 0x47800000 )         {                                 // is or must become NaN/Inf             if( val < 0x7f800000 )        // was value but too large                 return sign | 0x7c00;     // make it +/-Inf             return sign | 0x7c00 |        // remains +/-Inf or NaN                 ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits         }         return sign | 0x7bff;             // unrounded not quite Inf     }     if( val >= 0x38800000 )               // remains normalized value         return sign | val - 0x38000000 >>> 13; // exp - 127 + 15     if( val < 0x33000000 )                // too small for subnormal         return sign;                      // becomes +/-0     val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc     return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit          + ( 0x800000 >>> val - 102 )     // round depending on cut off       >>> 126 - val );   // div by 2^(1-(exp-127+15)) and >> 13 | exp=0 }

I implemented two small extensions compared to the book because the general precision for 16 bit floats is rather low which could make the inherent anomalies of floating point formats visually perceivable compared to larger floating point types where they are usually not noticed due to the ample precision.

The first one are these two lines in the toFloat() function:

if( mant == 0 && exp > 0x1c400 )  // smooth transition     return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff );

Floating point numbers in the normal range of the type size adopt the exponent and thus the precision to the magnitude of the value. But this is not a smooth adoption, it happens in steps: switching to the next higher exponent results in half the precision. The precision now remains the same for all values of the mantissa until the next jump to the next higher exponent. The extension code above makes these transitions smoother by returning a value that is in the geographical center of the covered 32 bit float range for this particular half float value. Every normal half float value maps to exactly 8192 32 bit float values. The returned value is supposed to be exactly in the middle of these values. But at the transition of the half float exponent the lower 4096 values have twice the precision as the upper 4096 values and thus cover a number space that is only half as large as on the other side. All these 8192 32 bit float values map to the same half float value, so converting a half float to 32 bit and back results in the same half float value regardless of which of the 8192 intermediate 32 bit values was chosen. The extension now results in something like a smoother half step by a factor of sqrt(2) at the transition as shown at the right picture below while the left picture is supposed to visualize the sharp step by a factor of two without anti aliasing. You can safely remove these two lines from the code to get the standard behavior.

covered number space on either side of the returned value:        6.0E-8             #######                  ##########        4.5E-8             |                       #        3.0E-8     #########               ########

The second extension is in the fromFloat() function:

    {                                     // avoid Inf due to rounding         if( ( fbits & 0x7fffffff ) >= 0x47800000 ) ...         return sign | 0x7bff;             // unrounded not quite Inf     }

This extension slightly extends the number range of the half float format by saving some 32 bit values form getting promoted to Infinity. The affected values are those that would have been smaller than Infinity without rounding and would become Infinity only due to the rounding. You can safely remove the lines shown above if you don't want this extension.

I tried to optimize the path for normal values in the fromFloat() function as much as possible which made it a bit less readable due to the use of precomputed and unshifted constants. I didn't put as much effort into 'toFloat()' since it would not exceed the performance of a lookup table anyway. So if speed really matters could use the toFloat() function only to fill a static lookup table with 0x10000 elements and than use this table for the actual conversion. This is about 3 times faster with a current x64 server VM and about 5 times faster with the x86 client VM.

I put the code hereby into public domain.

127

answered Oct 02 '22 13:10

x4u

The code by x4u encodes the value 1 correctly as 0x3c00 (ref: https://en.wikipedia.org/wiki/Half-precision_floating-point_format). But the decoder with smoothness improvements decodes that into 1.000122. The wikipedia entry says that integer values 0..2048 can be represented exactly. Not nice...
Removing the "| 0x3ff" from the toFloat code ensures that toFloat(fromFloat(k)) == k for integer k in the range -2048..2048, probably at the cost of a bit less smoothness.

answered Oct 02 '22 13:10

buttonius

Related questions
                            
                                How to use enums with JPA
                            
                                IOException: "Received authentication challenge is null" (Apache Harmony/Android)
                            
                                Turn Off Apache Common Logging
                            
                                Base64 Encoding safe for filenames?
                            
                                Java convert a HEX String to a BigInt
                            
                                Injecting Collection of Classes with Guice
                            
                                Migrating complex project from Ant to Maven - How to handle unusual folder structures?
                            
                                java 8 parallelStream() with sorted()
                            
                                Intellij: Search through the source of maven dependencies in a project
                            
                                Removing an SVN location from Eclipse using Subclipse
                            
                                reading android jpeg EXIF metadata from picture callback
                            
                                BufferedImage to JavaFX image
                            
                                Compute HMAC-SHA512 with secret key in java
                            
                                "IllegalArgumentException: Not a managed type" in Spring Boot application
                            
                                Setting JVM/JRE to use Windows Proxy Automatically
                            
                                setProperty must be overridden by all subclasses of SOAPMessage
                            
                                Java regular expression to match _all_ whitespace characters
                            
                                Default for XX:MaxDirectMemorySize
                            
                                Compare only the time portion of two dates, ignoring the date part
                            
                                No session repository could be auto-configured, check your configuration (session store type is 'null')

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Half-precision floating-point in Java

Tags:

java

floating-point

precision

ieee-754

finnw

People also ask

2 Answers

x4u

buttonius

Recent Activity

Donate For Us