Python: sort function breaks in the presence of nan

Tags:

sorted([2, float('nan'), 1]) returns [2, nan, 1]

(At least on Activestate Python 3.1 implementation.)

I understand nan is a weird object, so I wouldn't be surprised if it shows up in random places in the sort result. But it also messes up the sort for the non-nan numbers in the container, which is really unexpected.

I asked a related question about max, and based on that I understand why sort works like this. But should this be considered a bug?

Documentation just says "Return a new sorted list [...]" without specifying any details.

EDIT: I now agree that this isn't in violation of the IEEE standard. However, it's a bug from any common sense viewpoint, I think. Even Microsoft, which isn't known to admit their mistakes often, has recognized this one as a bug, and fixed it in the latest version: http://connect.microsoft.com/VisualStudio/feedback/details/363379/bug-in-list-double-sort-in-list-which-contains-double-nan.

Anyway, I ended up following @khachik's answer:

sorted(list_, key = lambda x : float('-inf') if math.isnan(x) else x)

I suspect it results in a performance hit compared to the language doing that by default, but at least it works (barring any bugs that I introduced).

999

asked Nov 21 '10 20:11

max

1 Answers

The previous answers are useful, but perhaps not clear regarding the root of the problem.

In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, a.k.a. operator <, could be used throughout if and only if less than defines a suitable ordering over the input values.

But this is specifically NOT true for floating point values and less-than: "NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754 based floating point)

So the possible solutions are:

remove the NaNs first, making the input domain well defined via < (or the other sorting function being used)

define a custom comparison function (a.k.a. predicate) that does define an ordering for NaN, such as less than any number, or greater than any number.

Either approach can be used, in any language.

Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.

Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key(). The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.

answered Oct 05 '22 23:10

Bob Davis

Related questions
                            
                                Why do CELERY_ROUTES have both a "queue" and a "routing_key"?
                            
                                Interactive matplotlib figures in Google Colab
                            
                                Using module's own objects in __main__.py
                            
                                Comparison of Python modes for Emacs
                            
                                Python debugger (pdb) stopped handlying up/down arrows, shows ^[[A instead
                            
                                Install python packages to correct anaconda environment
                            
                                How to set breakpoint in another module (don't set it on function definition line, if you want to break when function starts being executed)
                            
                                How to extract table as text from the PDF using Python?
                            
                                Converting string with UTC offset to a datetime object [duplicate]
                            
                                Numpy float64 vs Python float
                            
                                Python string formatting: is '%' more efficient than 'format' function?
                            
                                Serializing Foreign Key objects in Django
                            
                                multiprocessing.Pool - PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed
                            
                                TypeError: string indices must be integers, not str // working with dict
                            
                                Plot dendrogram using sklearn.AgglomerativeClustering
                            
                                Difference between Python float and numpy float32
                            
                                Namespace vs regular package
                            
                                Python: how does the functools cmp_to_key function works?
                            
                                Python Twitter library: which one? [closed]
                            
                                Advice on Python/Django and message queues [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: sort function breaks in the presence of nan

Tags:

python

sorting

math

nan

max

People also ask

1 Answers

Bob Davis

Recent Activity

Donate For Us