I'm having a difficult time to debug a problem in which the float <code>nan</code> in a <code>list</code> and <code>nan</code> in a <code>numpy.array</code> are handled differently when these are used in <code>itertools.groupby</code>: Given the following list and array: <pre class="prettyprint"><code>from itertools import groupby import numpy as np lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16] arr = np.array(lst) </code></pre> When I iterate over the list the contiguous <code>nan</code>s are grouped: <pre class="prettyprint"><code>>>> for key, group in groupby(lst): ... if np.isnan(key): ... print(key, list(group), type(key)) nan [nan, nan, nan] <class 'float'> nan [nan] <class 'float'> </code></pre> However if I use the array it puts successive <code>nan</code>s in different groups: <pre class="prettyprint"><code>>>> for key, group in groupby(arr): ... if np.isnan(key): ... print(key, list(group), type(key)) nan [nan] <class 'numpy.float64'> nan [nan] <class 'numpy.float64'> nan [nan] <class 'numpy.float64'> nan [nan] <class 'numpy.float64'> </code></pre> Even if I convert the array back to a list: <pre class="prettyprint"><code>>>> for key, group in groupby(arr.tolist()): ... if np.isnan(key): ... print(key, list(group), type(key)) nan [nan] <class 'float'> nan [nan] <class 'float'> nan [nan] <class 'float'> nan [nan] <class 'float'> </code></pre> I'm using: <pre class="prettyprint"><code>numpy 1.11.3 python 3.5 </code></pre> I know that generally <code>nan != nan</code> so why do these operations give different results? And how is it possible that <code>groupby</code> can group <code>nan</code>s at all?

The answers of tobias_k and ajcr are correct, it's because the <code>nan</code>s in the list have the same <code>id</code> while they have different ids when they are "iterated over" in the numpy-array. This answer is meant as a supplement for these answers. <pre class="prettyprint"><code>>>> from itertools import groupby >>> import numpy as np >>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16] >>> arr = np.array(lst) >>> for key, group in groupby(lst): ... if np.isnan(key): ... print(key, id(key), [id(item) for item in group]) nan 1274500321192 [1274500321192, 1274500321192, 1274500321192] nan 1274500321192 [1274500321192] >>> for key, group in groupby(arr): ... if np.isnan(key): ... print(key, id(key), [id(item) for item in group]) nan 1274537130480 [1274537130480] nan 1274537130504 [1274537130504] nan 1274537130480 [1274537130480] nan 1274537130480 [1274537130480] # same id as before but these are not consecutive >>> for key, group in groupby(arr.tolist()): ... if np.isnan(key): ... print(key, id(key), [id(item) for item in group]) nan 1274537130336 [1274537130336] nan 1274537130408 [1274537130408] nan 1274500320904 [1274500320904] nan 1274537130168 [1274537130168] </code></pre> The problem is that Python uses the <code>PyObject_RichCompare</code>-operation when comparing values, which only tests for object identity if <code>==</code> fails because it's not implemented. <code>itertools.groupby</code> on the other hand uses <code>PyObject_RichCompareBool</code> (see Source: 1, 2) which tests for object identity first and before <code>==</code> is tested. This can be verified with a small cython snippet: <pre class="prettyprint"><code>%load_ext cython %%cython from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ def compare(a, b): return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ) >>> compare(np.nan, np.nan) (False, True) </code></pre> The source code for <code>PyObject_RichCompareBool</code> reads like this: <pre class="prettyprint lang-c prettyprint-override"><code>/* Perform a rich comparison with object result. This wraps do_richcompare() with a check for NULL arguments and a recursion check. */ /* Perform a rich comparison with integer result. This wraps PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */ int PyObject_RichCompareBool(PyObject *v, PyObject *w, int op) { PyObject *res; int ok; /* Quick result when objects are the same. Guarantees that identity implies equality. */ /**********************That's the difference!****************/ if (v == w) { if (op == Py_EQ) return 1; else if (op == Py_NE) return 0; } res = PyObject_RichCompare(v, w, op); if (res == NULL) return -1; if (PyBool_Check(res)) ok = (res == Py_True); else ok = PyObject_IsTrue(res); Py_DECREF(res); return ok; } </code></pre> The object identity test (<code>if (v == w) </code>) is indeed done before the normal python comparison <code>PyObject_RichCompare(v, w, op);</code> is used and mentioned in its documentation: <blockquote> Note : If o1 and o2 are the same object, <code>PyObject_RichCompareBool()</code> will always return 1 for Py_EQ and 0 for Py_NE. </blockquote>

I am not sure whether this is the reason, but I just noticed this about the <code>nan</code> in <code>lst</code> and <code>arr</code>: <pre class="prettyprint"><code>>>> lst[0] == lst[1], arr[0] == arr[1] (False, False) >>> lst[0] is lst[1], arr[0] is arr[1] (True, False) </code></pre> I.e., while all <code>nan</code> are inequal, the regular <code>np.nan</code> (of type <code>float</code>) are all the same instance, while the <code>nan</code> in the <code>arr</code> are different instances of type <code>numpy.float64</code>). So my guess would be that if no <code>key</code> function is given, <code>groupby</code> will test for identity before doing the more expensive equality check. This is also consistent with the observation that is does not group in <code>arr.tolist()</code> either, because even though those <code>nan</code> are now <code>float</code> again, they are no longer the same instance. <pre class="prettyprint"><code>>>> atl = arr.tolist() >>> atl[0] is atl[1] False </code></pre>

Why can itertools.groupby group the NaNs in lists but not in numpy arrays

Tags:

python

arrays

list

nan

numpy

I'm having a difficult time to debug a problem in which the float nan in a list and nan in a numpy.array are handled differently when these are used in itertools.groupby:

Given the following list and array:

from itertools import groupby
import numpy as np

lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)

When I iterate over the list the contiguous nans are grouped:

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>

However if I use the array it puts successive nans in different groups:

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>

Even if I convert the array back to a list:

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>

I'm using:

numpy 1.11.3
python 3.5

I know that generally nan != nan so why do these operations give different results? And how is it possible that groupby can group nans at all?

648

asked Jan 18 '17 15:01

MSeifert

3 Answers

Python lists are just arrays of pointers to objects in memory. In particular lst holds pointers to the object np.nan:

>>> [id(x) for x in lst]
[139832272211880, # nan
 139832272211880, # nan
 139832272211880, # nan
 139832133974296,
 139832270325408,
 139832133974296,
 139832133974464,
 139832133974320,
 139832133974296,
 139832133974440,
 139832272211880, # nan
 139832133974296]

(np.nan is at 139832272211880 on my computer.)

On the other hand, NumPy arrays are just contiguous regions of memory; they are regions of bits and bytes that are interpreted as a sequence of values (floats, ints, etc.) by NumPy.

The trouble is that when you ask Python to iterate over a NumPy array holding floating values (at a for-loop or groupby level), Python needs to box these bytes into a proper Python object. It creates a brand new Python object in memory for each single value in the array as it iterates.

For example, you can see that that distinct objects for each nan value are created when .tolist() is called:

>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
 4355054640, # nan
 4355054664, # nan
 4355054688,
 4355054712,
 4355054736,
 4355054760,
 4355054784,
 4355054808,
 4355054832,
 4355054856, # nan
 4355054880]

itertools.groupby is able to group on np.nan for the Python list because it checks for identity first when it compares Python objects. Because these pointers to nan all point at the same np.nan object, grouping is possible.

However, iteration over the NumPy array does not allow this initial identity check to succeed, so Python falls back to checking for equality and nan != nan as you say.

132

answered Oct 18 '22 19:10

Alex Riley

The answers of tobias_k and ajcr are correct, it's because the nans in the list have the same id while they have different ids when they are "iterated over" in the numpy-array.

This answer is meant as a supplement for these answers.

>>> from itertools import groupby
>>> import numpy as np

>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480]  # same id as before but these are not consecutive

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]

The problem is that Python uses the PyObject_RichCompare-operation when comparing values, which only tests for object identity if == fails because it's not implemented. itertools.groupby on the other hand uses PyObject_RichCompareBool (see Source: 1, 2) which tests for object identity first and before == is tested.

This can be verified with a small cython snippet:

%load_ext cython
%%cython

from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ

def compare(a, b):
    return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)

>>> compare(np.nan, np.nan)
(False, True)

The source code for PyObject_RichCompareBool reads like this:

/* Perform a rich comparison with object result.  This wraps do_richcompare()
   with a check for NULL arguments and a recursion check. */

/* Perform a rich comparison with integer result.  This wraps
   PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    /**********************That's the difference!****************/
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    res = PyObject_RichCompare(v, w, op);
    if (res == NULL)
        return -1;
    if (PyBool_Check(res))
        ok = (res == Py_True);
    else
        ok = PyObject_IsTrue(res);
    Py_DECREF(res);
    return ok;
}

The object identity test (if (v == w) ) is indeed done before the normal python comparison PyObject_RichCompare(v, w, op); is used and mentioned in its documentation:

Note :

If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE.

answered Oct 18 '22 19:10

MSeifert

I am not sure whether this is the reason, but I just noticed this about the nan in lst and arr:

>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)

I.e., while all nan are inequal, the regular np.nan (of type float) are all the same instance, while the nan in the arr are different instances of type numpy.float64). So my guess would be that if no key function is given, groupby will test for identity before doing the more expensive equality check.

This is also consistent with the observation that is does not group in arr.tolist() either, because even though those nan are now float again, they are no longer the same instance.

>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False

answered Oct 18 '22 20:10

tobias_k

Related questions
                            
                                How to encode python dictionary?
                            
                                Most Efficient Way to "Slurp" All of STDIN Into a String
                            
                                How to input arguments after compiling python program with PyInstaller
                            
                                Get area within contours Opencv Python?
                            
                                Displaying dictionary data in Sphinx documentation
                            
                                Python asyncio task got bad yield
                            
                                What does `python setup.py check` actually do?
                            
                                "ImportError: no module named 'requests'" after installing with pip
                            
                                Django Rest Framework: empty request.data
                            
                                loc function in pandas
                            
                                Python: limit the width of printed columns of pandas DataFrame
                            
                                Pylint message: Invalid constant name (invalid-name)
                            
                                Groupby and lag all columns of a dataframe?
                            
                                Tensorflow "map operation" for tensor?
                            
                                unresolved attribute "Column" in class "SQLAlchemy"
                            
                                How to reverse a dictionary (whose values are lists) in Python?
                            
                                How can I separate runs of my TensorFlow code in TensorBoard?
                            
                                what is the correct way to check for False? [duplicate]
                            
                                why does pandas rolling use single dimension ndarray
                            
                                Object has no attribute '.__dict__' in python3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why can itertools.groupby group the NaNs in lists but not in numpy arrays

Tags:

python

arrays

list

nan

numpy

MSeifert

People also ask

3 Answers

Alex Riley

MSeifert

tobias_k

Recent Activity

Donate For Us