Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why adding multiple 'nan' in python dictionary giving multiple entries?

Example problem:

import numpy as np
dc = dict()
dc[np.float('nan')] = 100
dc[np.float('nan')] = 200

It is creating multiple entries for nan like

dc.keys() will produce {nan: 100, nan: 200} but it should create {nan: 200}.

like image 505
durjoy Avatar asked Jul 25 '17 10:07

durjoy


People also ask

Can dictionary can have multiple values Python?

In python, if we want a dictionary in which one key has multiple values, then we need to associate an object with each key as value. This value object should be capable of having various values inside it. We can either use a tuple or a list as a value in the dictionary to associate multiple values with a key.

What happens if you use the same key multiple times while creating a dictionary?

No, each key in a dictionary should be unique. You can't have two keys with the same value. Attempting to use the same key again will just overwrite the previous value stored. If a key needs to store multiple values, then the value associated with the key should be a list or another dictionary.

What is NaN in dictionary Python?

NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating-point value and cannot be converted to any other type than float.

How do you add multiple values to a dictionary in Python?

Method : Using isinstance() + update() + extend() The combination of above functions can be used to perform this task. In this, we check for the data type of value using isinstance() and perform dictionary update using update(), list update using extend().


2 Answers

The short answer to your question (of why adding NaN keys to a Python dict create multiple entries), is because floating-point NaN values are unordered, i.e. a NaN value is not equal to, greater than, or less than anything, including itself. This behavior is defined in the IEEE 754 standard for floating point arithmetic. The explanation why is that so is given by an IEEE 754 committee member in this answer.


For a longer, Python-specific, answer, let's first have a look at how item insertion and key comparison work in CPython dictionaries.

When you say d[key] = val, PyDict_SetItem() for dictionary d is called, which in turn calls (internal) insertdict(), which will either update the existing dictionary item, or insert a new item (maybe resizing the hash table consequentially).

The first step on insert is to lookup the key in the hash table of dictionary keys. A general-purpose lookup function that gets called in your case (of non-string keys) is lookdict().

lookdict will use key's hash value to locate the key, iterating over a list of possible keys with identical hash value, comparing first by address, then by calling keys' equivalence operator(s) (see excellent comments in Objects/dictobject.c for more details on hash collision resolution in Python's implementation of open addressing).

Since every float('nan') has the same hash value, yet each one is a different object (with a different "identity", i.e. memory address), and they're not equal by their float-value:

>>> a, b = float('nan'), float('nan')
>>> hash(a), hash(b)
(0, 0)
>>> id(a), id(b)
(94753433907296, 94753433907272)
>>> a == b
False

when you say:

d = dict()
d[float('nan')] = 1
d[float('nan')] = 2

lookdict will search for the second NaN by looking at its hash (0), then trying to resolve hash collision by iterating over keys with the same hash and comparing the keys by identity/address (they are different), then by invoking (the expensive) PyObject_RichCompareBool/do_richcompare, which in turn calls float_richcompare which compares floats just as C does:

/* Comparison is pretty much a nightmare.  When comparing float to float,
 * we do it as straightforwardly (and long-windedly) as conceivable, so
 * that, e.g., Python x == y delivers the same result as the platform
 * C x == y when x and/or y is a NaN.

which behaves according to IEEE 754 standard (from GNU C library docs):

20.5.2 Infinity and NaN

[...]

The basic operations and math functions all accept infinity and NaN and produce sensible output. Infinities propagate through calculations as one would expect: for example, 2 + ∞ = ∞, 4/∞ = 0, atan (∞) = π/2. NaN, on the other hand, infects any calculation that involves it. Unless the calculation would produce the same result no matter what real value replaced NaN, the result is NaN.

In comparison operations, positive infinity is larger than all values except itself and NaN, and negative infinity is smaller than all values except itself and NaN. NaN is unordered: it is not equal to, greater than, or less than anything, including itself. x == x is false if the value of x is NaN. You can use this to test whether a value is NaN or not, but the recommended way to test for NaN is with the isnan function (see Floating Point Classes). In addition, <, >, <=, and >= will raise an exception when applied to NaNs.

and which will return false for NaN == NaN.

That's why Python decides the second NaN object is worthy of a new dictionary entry. It may have the same hash, but its address and equivalence test say it is different from all the other NaN objects.

However, note that if you always use the same NaN object (with the same address) since the address is tested before float equivalence, you'll get the expected behavior:

>>> nan = float('nan')
>>> d = dict()
>>> d[nan] = 1
>>> d[nan] = 2
>>> d
{nan: 2}
like image 163
randomir Avatar answered Nov 08 '22 18:11

randomir


For historical reasons explained here, np.float('nan') == np.float('nan') is False. The rule is just that you can't have two dictionary keys which are equal to each other - so you can have two keys equal to np.float('nan').

Of course, this behavior is counterintuitive and surprising - so you should avoid using np.float('nan') as a key.

like image 29
perigon Avatar answered Nov 08 '22 16:11

perigon