Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a float to a non standard encoding

Tags:

c++

c

types

math

cocoa

I am writing a program that creates ICC color formats. These formats specify a data type called s15Fixed16Number which has a sign bit, 15 integer bits and 16 fractional bits. IEEE 754 32-bit floats have a sign bit, 8 exponent bits and 23 fractional bits.

I need to get input from a text box, and convert them into a s15Fixed16Number. Some searching turned up this on Google books, but that is talking about converting a decimal number to a s15Fixed16Number. I suppose I could just use the method explained in the link, but I haven't done any testing yet to determine how accurate that would be. I guess I could also try to convert the character input from the text box, but I haven't thought about that much yet.

I'm using Cocoa but I don't think that matters; any C function should work. Here are some example values in s15Fixed16Number format:

              -32768.0 = 0x80000000
                     0 = 0x00000000
                   1.0 = 0x00010000
 32767 + (65535/65536) = 0x7FFFFFFF

I guess it's been awhile since that numerical computation class!

like image 495
jonc Avatar asked Aug 30 '10 03:08

jonc


2 Answers

Assuming your C environment does 2's complement integers, then this is much simpler than it seems.

typedef long s1516;  // 32bit 2's complement signed integer
s1516 floattos1516(double f) {
    return (s1516)(f * 65536. + 0.5);
}

The representation is a fixed point value, with 16 bits of fraction. That is the same as a rational number whose denominator is always 65536 (or 216). To form such a rational from a floating point value, you just multiply by the denominator. Then it is just a matter of an appropriate rounding, and a truncation to the integral type.

The standard picked the form they did because this just works if your system uses 2's complement integer arithmetic. Although it is true that the leftmost bit does represent the sign, it is not a sign bit in the sense that is used in a floating point representation.

If your calculations are truly float rather than double, you will find that you don't have as much precision in your calculation as is available in the fixed point value for numbers near full scale. If you calculate in double, then you will always have more precision in your calculation than in the result.

Edit:

The apparently latest spec is available from the ICC as Specification ICC.1:2004-10 (Profile version 4.2.0.0). Section 5.1.3:

5.1.3 s15Fixed16Number

A fixed signed 4-byte/32-bit quantity which has 16 fractional bits as shown in table 3.

Table 3 — s15Fixed16Number
  Number               Encoding
-32768,0               80000000h
     0                 00000000h
     1,0               00010000h
 32767 + (65535/65536) 7FFFFFFFh

Aside from localized preference for the representation of a decimal point, these values are completely consistent with my understanding that the representation is simply signed 2's complement integers that should be divided by 65536 to get their values.

The natural conversion to the representation is simply to multiply by 65536, and from it simply to divide. Picking a suitable rounding rule is a matter of preference.

The full scale range is from -32768.0 (0x80000000) to approximately 32767.9999847412 (0x7fffffff), inclusive.

I would agree that it would be clearer if the specification had happened to show the representation in hex of any negative values. I skimmed the entire document, and the only values I found represented in both decimal and hex were CIE XYZ chromaticity coordinates, which by definition range from 0 to 1, and hence don't help as exemplar negative values.

like image 195
RBerteig Avatar answered Oct 23 '22 04:10

RBerteig


Don't get carried away about the internal representation of the float. Fixed-point values are just integers, with a constant scale factor. Just remember that you have more limited precision in floats than in your target format, so expected values may be off in the lower 9 bits for large values.

//s15Fixed16Number is presumably typedef'ed to unsigned int
float foo = 1.0f;
int fooFixedSigned = (int)(foo * 65536);
s15Fixed16Number fooFixed = (s15Fixed16Number)(abs(fooFixedSigned));
if (foo < 0) fooFixed = fooFixed | (1 << 31);
//you'll also need to explicitly check for overflows and underflows and handle them however is appropriate to your situation

Edit: corrected & to |

like image 21
Alan Avatar answered Oct 23 '22 05:10

Alan