Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pandas isnull() work but ==None not work?

I am trying to select the rows of df where the column label has value None. (It's value None I obtained from another function, not NaN)

Why does df[df['label'].isnull()] return the rows I wanted,

but df[df['label'] == None] returns Empty DataFrame Columns: [path, fanId, label, gain, order] Index: [] ?

like image 548
raffa Avatar asked Mar 03 '23 08:03

raffa


1 Answers

As the comment above states, missing data in pandas is represented by a NaN, where NaN is a numerical value, i.e float type. However None is a Python NoneType, so NaN will not be equivalent to None.

In [27]: np.nan == None
Out[27]: False

In this Github thread they discuss further, noting:

This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.

This means when you do df[df['label'] == None], you're going elementwise checking if np.nan == np.nan, which we know is false.

In [63]: np.nan == np.nan
Out[63]: False

Additionally you should not do df[df['label'] == None] when you're applying Boolean indexing, using == for a NoneType is not best practice as PEP8 mentions:

Comparisons to singletons like None should always be done with is or is not, never the equality operators.

For example you could do tst.value.apply(lambda x: x is None), which yields the same outcome as .isnull(), illustrating how pandas treats these as NaNs. Note this is for the below tst dataframe example, where tst.value.dtypes is an object of which I've explicitly specified the NoneType elements.

There is a nice example in the pandas docs which illustrate this and it's effect.

For example if you have two columns, one of type float and the other object you can see how pandas deals with the None type in a nice way, notice for float it is using NaN.

In [32]: tst = pd.DataFrame({"label" : [1, 2, None, 3, None], "value" : ["A", "B", None, "C", None]})

Out[39]:
   label value
0    1.0     A
1    2.0     B
2    NaN  None
3    3.0     C
4    NaN  None

In [51]: type(tst.value[2])
Out[51]: NoneType

In [52]: type(tst.label[2])
Out[52]: numpy.float64

This post explains the difference between NaN and None really well, would definitely take a look at this.

  • What is the difference between NaN and None?
like image 82
RK1 Avatar answered Mar 11 '23 05:03

RK1