NaN is handled perfectly when I check for its presence in a list or a set. But I don't understand how. [UPDATE: no it's not; it is reported as present if the identical instance of NaN is found; if only non-identical instances of NaN are found, it is reported as absent.] <ol> <li>I thought presence in a list is tested by equality, so I expected NaN to not be found since NaN != NaN.</li> <li>hash(NaN) and hash(0) are both 0. How do dictionaries and sets tell NaN and 0 apart?</li> <li>Is it safe to check for NaN presence in an arbitrary container using <code>in</code> operator? Or is it implementation dependent? </li> </ol> My question is about Python 3.2.1; but if there are any changes existing/planned in future versions, I'd like to know that too. <pre class="prettyprint"><code>NaN = float('nan') print(NaN != NaN) # True print(NaN == NaN) # False list_ = (1, 2, NaN) print(NaN in list_) # True; works fine but how? set_ = {1, 2, NaN} print(NaN in set_) # True; hash(NaN) is some fixed integer, so no surprise here print(hash(0)) # 0 print(hash(NaN)) # 0 set_ = {1, 2, 0} print(NaN in set_) # False; works fine, but how? </code></pre> Note that if I add an instance of a user-defined class to a <code>list</code>, and then check for containment, the instance's <code>__eq__</code> method is called (if defined) - at least in CPython. That's why I assumed that <code>list</code> containment is tested using operator <code>==</code>. EDIT: Per Roman's answer, it would seem that <code>__contains__</code> for <code>list</code>, <code>tuple</code>, <code>set</code>, <code>dict</code> behaves in a very strange way: <pre class="prettyprint"><code>def __contains__(self, x): for element in self: if x is element: return True if x == element: return True return False </code></pre> I say 'strange' because I didn't see it explained in the documentation (maybe I missed it), and I think this is something that shouldn't be left as an implementation choice. Of course, one NaN object may not be identical (in the sense of <code>id</code>) to another NaN object. (This not really surprising; Python doesn't guarantee such identity. In fact, I never saw CPython share an instance of NaN created in different places, even though it shares an instance of a small number or a short string.) This means that testing for NaN presence in a built-in container is undefined. This is very dangerous, and very subtle. Someone might run the very code I showed above, and incorrectly conclude that it's safe to test for NaN membership using <code>in</code>. I don't think there is a perfect workaround to this issue. One, very safe approach, is to ensure that NaN's are never added to built-in containers. (It's a pain to check for that all over the code...) Another alternative is watch out for cases where <code>in</code> might have NaN on the left side, and in such cases, test for NaN membership separately, using <code>math.isnan()</code>. In addition, other operations (e.g., set intersection) need to also be avoided or rewritten.

Question #1: why is NaN found in a container when it's an identical object. From the documentation: <blockquote> For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y). </blockquote> This is precisely what I observe with NaN, so everything is fine. Why this rule? I suspect it's because a <code>dict</code>/<code>set</code> wants to honestly report that it contains a certain object if that object is actually in it (even if <code>__eq__()</code> for whatever reason chooses to report that the object is not equal to itself). Question #2: why is the hash value for NaN the same as for 0? From the documentation: <blockquote> Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. hash() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects. </blockquote> Note that the requirement is only in one direction; objects that have the same hash do not have to be equal! At first I thought it's a typo, but then I realized that it's not. Hash collisions happen anyway, even with default <code>__hash__()</code> (see an excellent explanation here). The containers handle collisions without any problem. They do, of course, ultimately use the <code>==</code> operator to compare elements, hence they can easily end up with multiple values of NaN, as long as they are not identical! Try this: <pre class="prettyprint"><code>>>> nan1 = float('nan') >>> nan2 = float('nan') >>> d = {} >>> d[nan1] = 1 >>> d[nan2] = 2 >>> d[nan1] 1 >>> d[nan2] 2 </code></pre> So everything works as documented. But... it's very very dangerous! How many people knew that multiple values of NaN could live alongside each other in a dict? How many people would find this easy to debug?.. I would recommend to make NaN an instance of a subclass of <code>float</code> that doesn't support hashing and hence cannot be accidentally added to a <code>set</code>/<code>dict</code>. I'll submit this to python-ideas. Finally, I found a mistake in the documentation here: <blockquote> For user-defined classes which do not define <code>__contains__()</code> but do define <code>__iter__()</code>, <code>x in y</code> is true if some value <code>z</code> with <code>x == z</code> is produced while iterating over <code>y</code>. If an exception is raised during the iteration, it is as if <code>in</code> raised that exception. Lastly, the old-style iteration protocol is tried: if a class defines <code>__getitem__()</code>, <code>x in y</code> is true if and only if there is a non-negative integer index <code>i</code> such that <code>x == y[i]</code>, and all lower integer indices do not raise <code>IndexError</code> exception. (If any other exception is raised, it is as if <code>in</code> raised that exception). </blockquote> You may notice that there is no mention of <code>is</code> here, unlike with built-in containers. I was surprised by this, so I tried: <pre class="prettyprint"><code>>>> nan1 = float('nan') >>> nan2 = float('nan') >>> class Cont: ... def __iter__(self): ... yield nan1 ... >>> c = Cont() >>> nan1 in c True >>> nan2 in c False </code></pre> As you can see, the identity is checked first, before <code>==</code> - consistent with the built-in containers. I'll submit a report to fix the docs.

I can't repro you tuple/set cases using <code>float('nan')</code> instead of <code>NaN</code>. So i assume that it worked only because <code>id(NaN) == id(NaN)</code>, i.e. there is no interning for <code>NaN</code> objects: <pre class="prettyprint"><code>>>> NaN = float('NaN') >>> id(NaN) 34373956456 >>> id(float('NaN')) 34373956480 </code></pre> And <pre class="prettyprint"><code>>>> NaN is NaN True >>> NaN is float('NaN') False </code></pre> I believe tuple/set lookups has some optimization related to comparison of the same objects. Answering your question - it seam to be unsafe to relay on <code>in</code> operator while checking for presence of <code>NaN</code>. I'd recommend to use <code>None</code>, if possible. <hr> Just a comment. <code>__eq__</code> has nothing to do with <code>is</code> statement, and during lookups comparison of objects' ids seem to happen prior to any value comparisons: <pre class="prettyprint"><code>>>> class A(object): ... def __eq__(*args): ... print '__eq__' ... >>> A() == A() __eq__ # as expected >>> A() is A() False # `is` checks only ids >>> A() in [A()] __eq__ # as expected False >>> a = A() >>> a in [a] True # surprise! </code></pre>

Checking for NaN presence in a container

Tags:

python

equality

python-3.x

nan

containers

NaN is handled perfectly when I check for its presence in a list or a set. But I don't understand how. [UPDATE: no it's not; it is reported as present if the identical instance of NaN is found; if only non-identical instances of NaN are found, it is reported as absent.]

I thought presence in a list is tested by equality, so I expected NaN to not be found since NaN != NaN.
hash(NaN) and hash(0) are both 0. How do dictionaries and sets tell NaN and 0 apart?
Is it safe to check for NaN presence in an arbitrary container using in operator? Or is it implementation dependent?

My question is about Python 3.2.1; but if there are any changes existing/planned in future versions, I'd like to know that too.

NaN = float('nan')
print(NaN != NaN) # True
print(NaN == NaN) # False

list_ = (1, 2, NaN)
print(NaN in list_) # True; works fine but how?

set_ = {1, 2, NaN}
print(NaN in set_) # True; hash(NaN) is some fixed integer, so no surprise here
print(hash(0)) # 0
print(hash(NaN)) # 0
set_ = {1, 2, 0}
print(NaN in set_) # False; works fine, but how?

Note that if I add an instance of a user-defined class to a list, and then check for containment, the instance's __eq__ method is called (if defined) - at least in CPython. That's why I assumed that list containment is tested using operator ==.

EDIT:

Per Roman's answer, it would seem that __contains__ for list, tuple, set, dict behaves in a very strange way:

def __contains__(self, x):
  for element in self:
    if x is element:
      return True
    if x == element:
      return True
  return False

I say 'strange' because I didn't see it explained in the documentation (maybe I missed it), and I think this is something that shouldn't be left as an implementation choice.

Of course, one NaN object may not be identical (in the sense of id) to another NaN object. (This not really surprising; Python doesn't guarantee such identity. In fact, I never saw CPython share an instance of NaN created in different places, even though it shares an instance of a small number or a short string.) This means that testing for NaN presence in a built-in container is undefined.

This is very dangerous, and very subtle. Someone might run the very code I showed above, and incorrectly conclude that it's safe to test for NaN membership using in.

I don't think there is a perfect workaround to this issue. One, very safe approach, is to ensure that NaN's are never added to built-in containers. (It's a pain to check for that all over the code...)

Another alternative is watch out for cases where in might have NaN on the left side, and in such cases, test for NaN membership separately, using math.isnan(). In addition, other operations (e.g., set intersection) need to also be avoided or rewritten.

555

asked Mar 28 '12 09:03

max

2 Answers

Question #1: why is NaN found in a container when it's an identical object.

From the documentation:

For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y).

This is precisely what I observe with NaN, so everything is fine. Why this rule? I suspect it's because a dict/set wants to honestly report that it contains a certain object if that object is actually in it (even if __eq__() for whatever reason chooses to report that the object is not equal to itself).

Question #2: why is the hash value for NaN the same as for 0?

From the documentation:

Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. hash() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

Note that the requirement is only in one direction; objects that have the same hash do not have to be equal! At first I thought it's a typo, but then I realized that it's not. Hash collisions happen anyway, even with default __hash__() (see an excellent explanation here). The containers handle collisions without any problem. They do, of course, ultimately use the == operator to compare elements, hence they can easily end up with multiple values of NaN, as long as they are not identical! Try this:

>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> d = {}
>>> d[nan1] = 1
>>> d[nan2] = 2
>>> d[nan1]
1
>>> d[nan2]
2

So everything works as documented. But... it's very very dangerous! How many people knew that multiple values of NaN could live alongside each other in a dict? How many people would find this easy to debug?..

I would recommend to make NaN an instance of a subclass of float that doesn't support hashing and hence cannot be accidentally added to a set/dict. I'll submit this to python-ideas.

Finally, I found a mistake in the documentation here:

For user-defined classes which do not define __contains__() but do define __iter__(), x in y is true if some value z with x == z is produced while iterating over y. If an exception is raised during the iteration, it is as if in raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines __getitem__(), x in y is true if and only if there is a non-negative integer index i such that x == y[i], and all lower integer indices do not raise IndexError exception. (If any other exception is raised, it is as if in raised that exception).

You may notice that there is no mention of is here, unlike with built-in containers. I was surprised by this, so I tried:

>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> class Cont:
...   def __iter__(self):
...     yield nan1
...
>>> c = Cont()
>>> nan1 in c
True
>>> nan2 in c
False

As you can see, the identity is checked first, before == - consistent with the built-in containers. I'll submit a report to fix the docs.

answered Oct 17 '22 07:10

max

I can't repro you tuple/set cases using float('nan') instead of NaN.

So i assume that it worked only because id(NaN) == id(NaN), i.e. there is no interning for NaN objects:

>>> NaN = float('NaN')
>>> id(NaN)
34373956456
>>> id(float('NaN'))
34373956480

And

>>> NaN is NaN
True
>>> NaN is float('NaN')
False

I believe tuple/set lookups has some optimization related to comparison of the same objects.

Answering your question - it seam to be unsafe to relay on in operator while checking for presence of NaN. I'd recommend to use None, if possible.

Just a comment. __eq__ has nothing to do with is statement, and during lookups comparison of objects' ids seem to happen prior to any value comparisons:

>>> class A(object):
...     def __eq__(*args):
...             print '__eq__'
...
>>> A() == A()
__eq__          # as expected
>>> A() is A()
False           # `is` checks only ids
>>> A() in [A()]
__eq__          # as expected
False
>>> a = A()
>>> a in [a]
True            # surprise!

answered Oct 17 '22 09:10

Roman Bodnarchuk

Related questions
                            
                                Vim Python indentation not working?
                            
                                Making filenames/line numbers linkable in Emacs gud buffer
                            
                                global set up in django test framework?
                            
                                Python: Pickling highly-recursive objects without using `setrecursionlimit`
                            
                                How to display total record count against models in django admin
                            
                                Is it possible to scan for Wi-Fi using Python?
                            
                                use the system monospace font in gtk textview
                            
                                Python multiprocessing Pool.map is calling aquire?
                            
                                Why is self only a convention and not a real Python keyword?
                            
                                Python: Organization of user-defined exceptions in a complete project
                            
                                sending password to command line tools
                            
                                Edit with IDLE (Python GUI) context menu on Windows&nbsp;7
                            
                                Is it pythonic to use interfaces / abstract base classes?
                            
                                os.getcwd() for a different drive in Windows
                            
                                Designing an async API in Python
                            
                                Dividing decimals yields invalid results in Python 2.5 to 2.7
                            
                                Using python multiprocessing pipes
                            
                                Does Django cache related ForeignKey and ManyToManyField fields once they're accessed?
                            
                                markdown to html using a specified css
                            
                                Django JSON De-serialization Security

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With