Example problem:
import numpy as np
dc = dict()
dc[np.float('nan')] = 100
dc[np.float('nan')] = 200
It is creating multiple entries for nan
like
dc.keys()
will produce {nan: 100, nan: 200}
but it should create {nan: 200}
.
In python, if we want a dictionary in which one key has multiple values, then we need to associate an object with each key as value. This value object should be capable of having various values inside it. We can either use a tuple or a list as a value in the dictionary to associate multiple values with a key.
No, each key in a dictionary should be unique. You can't have two keys with the same value. Attempting to use the same key again will just overwrite the previous value stored. If a key needs to store multiple values, then the value associated with the key should be a list or another dictionary.
NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating-point value and cannot be converted to any other type than float.
Method : Using isinstance() + update() + extend() The combination of above functions can be used to perform this task. In this, we check for the data type of value using isinstance() and perform dictionary update using update(), list update using extend().
The short answer to your question (of why adding NaN
keys to a Python dict
create multiple entries), is because floating-point NaN
values are unordered, i.e. a NaN
value is not equal to, greater than, or less than anything, including itself. This behavior is defined in the IEEE 754 standard for floating point arithmetic. The explanation why is that so is given by an IEEE 754 committee member in this answer.
For a longer, Python-specific, answer, let's first have a look at how item insertion and key comparison work in CPython dictionaries.
When you say d[key] = val
, PyDict_SetItem()
for dictionary d
is called, which in turn calls (internal) insertdict()
, which will either update the existing dictionary item, or insert a new item (maybe resizing the hash table consequentially).
The first step on insert is to lookup the key
in the hash table of dictionary keys. A general-purpose lookup function that gets called in your case (of non-string keys) is lookdict()
.
lookdict
will use key
's hash value to locate the key
, iterating over a list of possible keys with identical hash value, comparing first by address, then by calling key
s' equivalence operator(s) (see excellent comments in Objects/dictobject.c
for more details on hash collision resolution in Python's implementation of open addressing).
Since every float('nan')
has the same hash value, yet each one is a different object (with a different "identity", i.e. memory address), and they're not equal by their float-value:
>>> a, b = float('nan'), float('nan')
>>> hash(a), hash(b)
(0, 0)
>>> id(a), id(b)
(94753433907296, 94753433907272)
>>> a == b
False
when you say:
d = dict()
d[float('nan')] = 1
d[float('nan')] = 2
lookdict
will search for the second NaN
by looking at its hash (0
), then trying to resolve hash collision by iterating over keys with the same hash and comparing the keys by identity/address (they are different), then by invoking (the expensive) PyObject_RichCompareBool
/do_richcompare
, which in turn calls float_richcompare
which compares floats just as C does:
/* Comparison is pretty much a nightmare. When comparing float to float,
* we do it as straightforwardly (and long-windedly) as conceivable, so
* that, e.g., Python x == y delivers the same result as the platform
* C x == y when x and/or y is a NaN.
which behaves according to IEEE 754 standard (from GNU C library docs):
20.5.2 Infinity and NaN
[...]
The basic operations and math functions all accept infinity and NaN and produce sensible output. Infinities propagate through calculations as one would expect: for example, 2 + ∞ = ∞, 4/∞ = 0, atan (∞) = π/2. NaN, on the other hand, infects any calculation that involves it. Unless the calculation would produce the same result no matter what real value replaced NaN, the result is NaN.
In comparison operations, positive infinity is larger than all values except itself and NaN, and negative infinity is smaller than all values except itself and NaN. NaN is unordered: it is not equal to, greater than, or less than anything, including itself. x == x is false if the value of x is NaN. You can use this to test whether a value is NaN or not, but the recommended way to test for NaN is with the isnan function (see Floating Point Classes). In addition, <, >, <=, and >= will raise an exception when applied to NaNs.
and which will return false
for NaN == NaN
.
That's why Python decides the second NaN
object is worthy of a new dictionary entry. It may have the same hash, but its address and equivalence test say it is different from all the other NaN
objects.
However, note that if you always use the same NaN
object (with the same address) since the address is tested before float equivalence, you'll get the expected behavior:
>>> nan = float('nan')
>>> d = dict()
>>> d[nan] = 1
>>> d[nan] = 2
>>> d
{nan: 2}
For historical reasons explained here,
np.float('nan') == np.float('nan')
is False. The rule is just that you can't have two dictionary keys which are equal to each other - so you can have two keys equal to np.float('nan')
.
Of course, this behavior is counterintuitive and surprising - so you should avoid using np.float('nan')
as a key.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With