I'm having a difficult time to debug a problem in which the float nan
in a list
and nan
in a numpy.array
are handled differently when these are used in itertools.groupby
:
Given the following list and array:
from itertools import groupby
import numpy as np
lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)
When I iterate over the list the contiguous nan
s are grouped:
>>> for key, group in groupby(lst):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>
However if I use the array it puts successive nan
s in different groups:
>>> for key, group in groupby(arr):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
Even if I convert the array back to a list:
>>> for key, group in groupby(arr.tolist()):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
I'm using:
numpy 1.11.3
python 3.5
I know that generally nan != nan
so why do these operations give different results? And how is it possible that groupby
can group nan
s at all?
To test array for NaN, use the numpy. isnan() method in Python Numpy. Returns True where x is NaN, false otherwise. This is a scalar if x is a scalar.
For scalar input, the result is a new boolean with value True if the input is NaN; otherwise the value is False. For array input, the result is a boolean array of the same dimensions as the input and the values are True if the corresponding element of the input is NaN; otherwise the values are False.
1. NumPy uses much less memory to store data. The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.
To check for NaN values in a Numpy array you can use the np. isnan() method. This outputs a boolean mask of the size that of the original array. The output array has true for the indices which are NaNs in the original array and false for the rest.
Python lists are just arrays of pointers to objects in memory. In particular lst
holds pointers to the object np.nan
:
>>> [id(x) for x in lst]
[139832272211880, # nan
139832272211880, # nan
139832272211880, # nan
139832133974296,
139832270325408,
139832133974296,
139832133974464,
139832133974320,
139832133974296,
139832133974440,
139832272211880, # nan
139832133974296]
(np.nan
is at 139832272211880 on my computer.)
On the other hand, NumPy arrays are just contiguous regions of memory; they are regions of bits and bytes that are interpreted as a sequence of values (floats, ints, etc.) by NumPy.
The trouble is that when you ask Python to iterate over a NumPy array holding floating values (at a for
-loop or groupby
level), Python needs to box these bytes into a proper Python object. It creates a brand new Python object in memory for each single value in the array as it iterates.
For example, you can see that that distinct objects for each nan
value are created when .tolist()
is called:
>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
4355054640, # nan
4355054664, # nan
4355054688,
4355054712,
4355054736,
4355054760,
4355054784,
4355054808,
4355054832,
4355054856, # nan
4355054880]
itertools.groupby
is able to group on np.nan
for the Python list because it checks for identity first when it compares Python objects. Because these pointers to nan
all point at the same np.nan
object, grouping is possible.
However, iteration over the NumPy array does not allow this initial identity check to succeed, so Python falls back to checking for equality and nan != nan
as you say.
The answers of tobias_k and ajcr are correct, it's because the nan
s in the list have the same id
while they have different ids when they are "iterated over" in the numpy-array.
This answer is meant as a supplement for these answers.
>>> from itertools import groupby
>>> import numpy as np
>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)
>>> for key, group in groupby(lst):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]
>>> for key, group in groupby(arr):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480] # same id as before but these are not consecutive
>>> for key, group in groupby(arr.tolist()):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]
The problem is that Python uses the PyObject_RichCompare
-operation when comparing values, which only tests for object identity if ==
fails because it's not implemented. itertools.groupby
on the other hand uses PyObject_RichCompareBool
(see Source: 1, 2) which tests for object identity first and before ==
is tested.
This can be verified with a small cython snippet:
%load_ext cython
%%cython
from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ
def compare(a, b):
return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)
>>> compare(np.nan, np.nan)
(False, True)
The source code for PyObject_RichCompareBool
reads like this:
/* Perform a rich comparison with object result. This wraps do_richcompare()
with a check for NULL arguments and a recursion check. */
/* Perform a rich comparison with integer result. This wraps
PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
PyObject *res;
int ok;
/* Quick result when objects are the same.
Guarantees that identity implies equality. */
/**********************That's the difference!****************/
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}
res = PyObject_RichCompare(v, w, op);
if (res == NULL)
return -1;
if (PyBool_Check(res))
ok = (res == Py_True);
else
ok = PyObject_IsTrue(res);
Py_DECREF(res);
return ok;
}
The object identity test (if (v == w)
) is indeed done before the normal python comparison PyObject_RichCompare(v, w, op);
is used and mentioned in its documentation:
Note :
If o1 and o2 are the same object,
PyObject_RichCompareBool()
will always return 1 for Py_EQ and 0 for Py_NE.
I am not sure whether this is the reason, but I just noticed this about the nan
in lst
and arr
:
>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)
I.e., while all nan
are inequal, the regular np.nan
(of type float
) are all the same instance, while the nan
in the arr
are different instances of type numpy.float64
). So my guess would be that if no key
function is given, groupby
will test for identity before doing the more expensive equality check.
This is also consistent with the observation that is does not group in arr.tolist()
either, because even though those nan
are now float
again, they are no longer the same instance.
>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With