Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Behaviour of custom NaN floats in Python and Numpy

I need to pack some extra information into floating point NaN values. I am using single-precision IEEE 754 floats (32-bit floats) in Python. How do Python and NumPy treat these values?

Theory

The IEEE 754-2008 standard seems to think a number is really not a number, if the exponent bits (23..30) are set, and at least one of the significand bits is set. Thus if we transform the float into a 32-bit integer representation, anything satisfying the following conditions goes:

  • i & 0x7f800000 == 0x7f800000
  • i & 0x007fffff != 0

This would leave me with plenty of choice. However, the standard seems to say that the highest bit of the significand is is_quiet and should be set to avoid exceptions in calculations.

Practical tests

Python 2.7

In order to be sure, I ran some tests with interesting results:

import math
import struct

std_nan = struct.unpack("f4", struct.pack("I", 0x7fc00000))[0]
spec_nan = struct.unpack("f4", struct.pack("I", 0x7f800001))[0]
spec2_nan = struct.unpack("f4", struct.pack("I", 0x7fc00001))[0]

print "{:08x}".format(struct.unpack("I", struct.pack("f4", std_nan))[0])
print "{:08x}".format(struct.unpack("I", struct.pack("f4", spec_nan))[0])
print "{:08x}".format(struct.unpack("I", struct.pack("f4", spec2_nan))[0])

This gives:

7fc00000
7fc00001 <<< should be 7f800001
7fc00001

This and some further testing seems to suggst that something (struct.unpack?) always sets the is_quiet bit.

NumPy

I tried the same with NumPy, because there I can always rely on conversions not changing a single bit:

import numpy as np

intarr = np.array([0x7f800001], dtype='uint32')
f = np.fromstring(intarr.tostring(), dtype='f4')
print np.isnan(f)

This gives:

RuntimeWarning: invalid value encountered in isnan
[True]

but if the value is replaced by 0x7fc00001, there is no error.

Hypothesis

Both Python and NumPy will be happy, if I set the is_quiet and use the rest of the bits for my own purposes. Python handles the bit by itself, NumPy relies on lower-level language implementations and/or the hardware FP implementation.

Question

Is my hypothesis correct, and can it be proved or disproved by some official documentation? Or is it one of those platform-dependent things?

I found something quite related here: How to distinguish different types of NaN float in Python, but I could not find any official word on how extra-information-carrying NaNs should be handled in Python or NumPy.

like image 252
DrV Avatar asked Jul 06 '14 22:07

DrV


People also ask

Is NumPy NaN float?

NaN stands for Not A Number and is a common missing data representation. It is a special floating-point value and cannot be converted to any other type than float.

How does NumPy deal with NaN?

In Python, NumPy with the latest version where nan is a value only for floating arrays only which stands for not a number and is a numeric data type which is used to represent an undefined value. In Python, NumPy defines NaN as a constant value.

Does NumPy support NaN?

No, you can't, at least with current version of NumPy. A nan is a special value for float arrays only.


1 Answers

After thinking of this for some time and having a look at the source code ad then rethinking a bit, I think I can answer my own question. My hypotheses are almost correct but not the whole story.

As NumPy and Python handle numbers quite differently, this answer has two parts.

What really happens in Python and NumPy with NaNs

NumPy

This may be slightly platform-specific, but on most platforms NumPy uses the gcc builtin isnan, which in turn does something fast. The runtime warnings come from the deeper levels, from the hardware in most cases. (NumPy may use on of several methods of determining the NaN status, such as x != x, which works on at least AMD 64 platforms, but with gcc it is down to gcc, which probably uses some pretty short code for the purpose.)

So, in theory there is no way to guarantee how NumPy handles NaNs, but in practice on the more common platforms it will do as the standard says because that's what the hardware does. NumPy itself does not care about the NaN types at all. (Except for some NumPy-specific non-hw-supported data types and platforms.)

Python

Here the story becomes interesting. If the platform supports IEEE floats (most do), Python uses the C library for floating point arithmetics, and thus almost directly hardware instructions in most cases. So there should not be any difference to NumPy.

Except for... There is usually no such thing as a 32-bit float in Python. Python float objects use C double, which is a 64-bit format. How does one transform special NaNs between these formats? In order to see what happens in practice, the following little C code helps:

/* nantest.c - Test floating point nan behaviour with type casts */

#include <stdio.h>
#include <stdint.h>

static uint32_t u1 = 0x7fc00000;
static uint32_t u2 = 0x7f800001;
static uint32_t u3 = 0x7fc00001;

int main(void)
    {
    float f1, f2, f3;
    float f1p, f2p, f3p;
    double d1, d2, d3;
    uint32_t u1p, u2p, u3p;
    uint64_t l1, l2, l3;

    // Convert uint32 -> float
    f1 = *(float *)&u1; f2 = *(float *)&u2; f3 = *(float *)&u3;

    // Convert float -> double (type cast, real conversion)
    d1 = (double)f1; d2 = (double)f2; d3 = (double)f3;

    // Convert the doubles into long ints
    l1 = *(uint64_t *)&d1; l2 = *(uint64_t *)&d2; l3 = *(uint64_t *)&d3;

    // Convert the doubles back to floats
    f1p = (float)d1; f2p = (float)d2; f3p = (float)d3;

    // Convert the floats back to uints
    u1p = *(uint32_t *)&f1p; u2p = *(uint32_t *)&f2p; u3p = *(uint32_t *)&f3p;

    printf("%f (%08x) -> %lf (%016llx) -> %f (%08x)\n", f1, u1, d1, l1, f1p, u1p);
    printf("%f (%08x) -> %lf (%016llx) -> %f (%08x)\n", f2, u2, d2, l2, f2p, u2p);
    printf("%f (%08x) -> %lf (%016llx) -> %f (%08x)\n", f3, u3, d3, l3, f3p, u3p);

    return 0;
    }

This prints:

nan (7fc00000) -> nan (7ff8000000000000) -> nan (7fc00000)
nan (7f800001) -> nan (7ff8000020000000) -> nan (7fc00001)
nan (7fc00001) -> nan (7ff8000020000000) -> nan (7fc00001)

By looking at row 2 it is obvious that we have the same phenomenon as we had with Python. So, it is the conversion to double that introduces the extra is_quiet bit immediately after the exponent in the 64-bit version.

This sounds a bit strange, but actually the standard says (IEEE 754-2008, section 6.2.3):

Conversion of a quiet NaN from a narrower format to a wider format in the same radix, and then back to the same narrower format, should not change the quiet NaN payload in any way except to make it canonical.

This does not say anything about the propagation of signaled NaN's. However, that is explained by section 6.2.1.:

For binary formats, the payload is encoded in the p − 2 least significant bits of the trailing significand field.

The p above is precision, 24 bits for a 32-bit float. So, my mistake was to use signaled NaNs for payload.

Summary

I got the following take home points:

  • the use of qNaNs (quiet NaNs) is supported and encouraged by the IEEE 754-2008
  • odd results were because I tried to use sNaNs and type conversions resulted in the is_quiet bit being set
  • both NumPy and Python act according to IEEE 754 on the most common platforms
  • the implementation leans heavily on the underlying C implementation and thus guarantees very little (there is even some code in Python which acknowledges that NaNs are not handled as they should be on some platforms)
  • the only safe way to handle this is to do a bit of DIY with the payload

There is however, one thing that is implemented in neither Python nor NumPy (nor any other language I have come across with). Section 5.12.1:

Language standards should provide an optional conversion of NaNs in a supported format to external character sequences which appends to the basic NaN character sequences a suffix that can represent the NaN payload (see 6.2). The form and interpretation of the payload suffix is language-defined. The language standard shall require that any such optional output sequences be accepted as input in conversion of external character sequences to supported formats.

like image 145
DrV Avatar answered Nov 04 '22 13:11

DrV