Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can itertools.groupby group the NaNs in lists but not in numpy arrays

I'm having a difficult time to debug a problem in which the float nan in a list and nan in a numpy.array are handled differently when these are used in itertools.groupby:

Given the following list and array:

from itertools import groupby
import numpy as np

lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)

When I iterate over the list the contiguous nans are grouped:

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>

However if I use the array it puts successive nans in different groups:

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>

Even if I convert the array back to a list:

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>

I'm using:

numpy 1.11.3
python 3.5

I know that generally nan != nan so why do these operations give different results? And how is it possible that groupby can group nans at all?

like image 648
MSeifert Avatar asked Jan 18 '17 15:01

MSeifert


People also ask

How do I check if a Numpy array has NaN values?

To test array for NaN, use the numpy. isnan() method in Python Numpy. Returns True where x is NaN, false otherwise. This is a scalar if x is a scalar.

Is NaN in Numpy?

For scalar input, the result is a new boolean with value True if the input is NaN; otherwise the value is False. For array input, the result is a boolean array of the same dimensions as the input and the values are True if the corresponding element of the input is NaN; otherwise the values are False.

What is are the advantage S of Numpy arrays over classic Python lists?

1. NumPy uses much less memory to store data. The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.

How do you find NaN values in an array?

To check for NaN values in a Numpy array you can use the np. isnan() method. This outputs a boolean mask of the size that of the original array. The output array has true for the indices which are NaNs in the original array and false for the rest.


3 Answers

Python lists are just arrays of pointers to objects in memory. In particular lst holds pointers to the object np.nan:

>>> [id(x) for x in lst]
[139832272211880, # nan
 139832272211880, # nan
 139832272211880, # nan
 139832133974296,
 139832270325408,
 139832133974296,
 139832133974464,
 139832133974320,
 139832133974296,
 139832133974440,
 139832272211880, # nan
 139832133974296]

(np.nan is at 139832272211880 on my computer.)

On the other hand, NumPy arrays are just contiguous regions of memory; they are regions of bits and bytes that are interpreted as a sequence of values (floats, ints, etc.) by NumPy.

The trouble is that when you ask Python to iterate over a NumPy array holding floating values (at a for-loop or groupby level), Python needs to box these bytes into a proper Python object. It creates a brand new Python object in memory for each single value in the array as it iterates.

For example, you can see that that distinct objects for each nan value are created when .tolist() is called:

>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
 4355054640, # nan
 4355054664, # nan
 4355054688,
 4355054712,
 4355054736,
 4355054760,
 4355054784,
 4355054808,
 4355054832,
 4355054856, # nan
 4355054880]

itertools.groupby is able to group on np.nan for the Python list because it checks for identity first when it compares Python objects. Because these pointers to nan all point at the same np.nan object, grouping is possible.

However, iteration over the NumPy array does not allow this initial identity check to succeed, so Python falls back to checking for equality and nan != nan as you say.

like image 132
Alex Riley Avatar answered Oct 18 '22 19:10

Alex Riley


The answers of tobias_k and ajcr are correct, it's because the nans in the list have the same id while they have different ids when they are "iterated over" in the numpy-array.

This answer is meant as a supplement for these answers.

>>> from itertools import groupby
>>> import numpy as np

>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480]  # same id as before but these are not consecutive

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]

The problem is that Python uses the PyObject_RichCompare-operation when comparing values, which only tests for object identity if == fails because it's not implemented. itertools.groupby on the other hand uses PyObject_RichCompareBool (see Source: 1, 2) which tests for object identity first and before == is tested.

This can be verified with a small cython snippet:

%load_ext cython
%%cython

from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ

def compare(a, b):
    return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)

>>> compare(np.nan, np.nan)
(False, True)

The source code for PyObject_RichCompareBool reads like this:

/* Perform a rich comparison with object result.  This wraps do_richcompare()
   with a check for NULL arguments and a recursion check. */

/* Perform a rich comparison with integer result.  This wraps
   PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    /**********************That's the difference!****************/
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    res = PyObject_RichCompare(v, w, op);
    if (res == NULL)
        return -1;
    if (PyBool_Check(res))
        ok = (res == Py_True);
    else
        ok = PyObject_IsTrue(res);
    Py_DECREF(res);
    return ok;
}

The object identity test (if (v == w) ) is indeed done before the normal python comparison PyObject_RichCompare(v, w, op); is used and mentioned in its documentation:

Note :

If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE.

like image 39
MSeifert Avatar answered Oct 18 '22 19:10

MSeifert


I am not sure whether this is the reason, but I just noticed this about the nan in lst and arr:

>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)

I.e., while all nan are inequal, the regular np.nan (of type float) are all the same instance, while the nan in the arr are different instances of type numpy.float64). So my guess would be that if no key function is given, groupby will test for identity before doing the more expensive equality check.

This is also consistent with the observation that is does not group in arr.tolist() either, because even though those nan are now float again, they are no longer the same instance.

>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False
like image 20
tobias_k Avatar answered Oct 18 '22 20:10

tobias_k