Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behavior checking if np.nan is in list from pandas DataFrame

It seems that checking if np.nan is in a list after pulling the list from a pandas dataframe does not correctly return True as expected. I have an example below to demonstrate:

from numpy import nan
import pandas as pd

basic_list = [0.0, nan, 1.0, 2.0]
nan_in_list = (nan in basic_list)
print(f"Is nan in {basic_list}? {nan_in_list}")

df = pd.DataFrame({'test_list': basic_list})
pandas_list = df['test_list'].to_list()
nan_in_pandas_list = (nan in pandas_list)
print(f"Is nan in {pandas_list}? {nan_in_pandas_list}")

I would expect the output of this program to be:

Is nan in [0.0, nan, 1.0, 2.0]? True
Is nan in [0.0, nan, 1.0, 2.0]? True

But instead it is

Is nan in [0.0, nan, 1.0, 2.0]? True
Is nan in [0.0, nan, 1.0, 2.0]? False

What is the cause of this odd behavior or am I missing something?

Edit: Adding on to this, if I run the code:

for item in pandas_list:
    print(type(item))
    print(item)

it has the exact same output as if I were to swap pandas_list with basic_list. However pandas_list == basic_list evaluates to False.

like image 815
frenchytheasian Avatar asked Nov 29 '25 18:11

frenchytheasian


2 Answers

TL;DR

pandas is using different nan object than np.nan and in operator for list checks if the object is the same.


The in operator invokes __contains__ magic method of list, here is source code:

static int
list_contains(PyListObject *a, PyObject *el)
{
    PyObject *item;
    Py_ssize_t i;
    int cmp;

    for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i) {
        item = PyList_GET_ITEM(a, i);
        Py_INCREF(item);
        cmp = PyObject_RichCompareBool(item, el, Py_EQ);
        Py_DECREF(item);
    }
    return cmp;
}

You see there is PyObject_RichCompareBool called which states:

If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE.

So:

basic_list = [0.0, nan, 1.0, 2.0]
for v in basic_list:
    print(v == nan, v is nan)

print(nan in basic_list)

Prints:

False False
False True
False False
False False
True

And:

df = pd.DataFrame({"test_list": basic_list})
pandas_list = df["test_list"].to_list()

for v in pandas_list:
    print(v == nan, v is nan)

print(nan in pandas_list)

Prints:

False False
False False
False False
False False
False

Evidently, pandas is using different nan object.

like image 109
Andrej Kesely Avatar answered Dec 01 '25 06:12

Andrej Kesely


So, for the built-in list type, in checks containment using is first (as an optimization). From the docs:

For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y).

(note, of course, the above isn't how this is actually implemented! dicts and sets, for example, are using hash-based approaches to check containment, so they will be average case O(1) instead of O(n))

This is an optimization that the runtime uses because well behaved types should always respect the logical implication that "if x is y, the x == y". But *this happens to not be true with the very strange value of float('nan').

Since you are using the same object to check, the one that numpy plops into the main namespace for you (it is literally doing something like nan = float('nan')) it turns out this will be true when you construct a list using that object.

We can reproduce this behavior like this:

nan = float('nan')
data = [1, nan, 3]
print(nan in data) # True
print(float('nan') in data) # False
like image 39
juanpa.arrivillaga Avatar answered Dec 01 '25 07:12

juanpa.arrivillaga



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!