Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't "is not None" work with dataframe.loc, but "!= None" works fine?

Tags:

python

pandas

I'm currently playing with Pandas dataframe, I wanted to select all data entries in the dataframe which has not None entity attribute.

df_ = df.loc[df['entities'] != None]

seems work pretty well but

df_ = df.loc[df['entities'] is not None]

will raise KeyError which is

File "pandas_libs\index.pyx", line 107, in >pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index.pyx", line 128, in >pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index_class_helper.pxi", line 91, in >pandas._libs.index.Int64Engine._check_type KeyError: True

Well I already know the solution to my original problem, I'm just curious why this happens

like image 817
RoloSong Avatar asked Aug 27 '19 06:08

RoloSong


2 Answers

Instead of

df_ = df.loc[df['entities'] is not None]

or

df_ = df.loc[df['entities'] != None]

you should rather use

df_ = df.loc[df['entities'].isna()]

Because the representation of missing values in pandas is different from the usual python way to represent missing values by None. In particular you get the key error, because the column series df['entities'] is checked for identity with None. This evaluates to True in any case, because the series is not None. Then the .loc searches the row index for True, which is not present in your case, so it raises the exception. != doesn't cause this exception to be raised, because the equality operator is overloaded by pandas.Series (otherwise you couldn't build indexers by comparing a column with a fixed value like in df['name'] == 'Miller'). This overloaded method performs an elementwise comparison and itself returns an indexer that works fine with the .loc method. Just the result might not be, what you intended.

E.g. if you do

import pandas as pd
import numpy as np
df= pd.DataFrame(dict(x=[1,2,3], y=list('abc'), nulls= [None, np.NaN, np.float32('inf')]))

df['nulls'].isna()

It returns:

Out[18]: 
0     True
1     True
2    False
Name: nulls, dtype: bool

but the code:

df['nulls'] == None

returns

Out[20]: 
0    False
1    False
2    False
Name: nulls, dtype: bool

If you look at the datatype of the objects stored in the column, you see they are all floats:

df['nulls'].map(type)
Out[19]: 
0    <class 'float'>
1    <class 'float'>
2    <class 'float'>
Name: nulls, dtype: object

For columns of other types the representation of missing values might even be different. E.g. if you use Int64 columns it looks like this:

df['nulls_int64']= pd.Series([None, 1 , 2], dtype='Int64')
df['nulls_int64'].map(type)
Out[26]: 
0    <class 'float'>
1      <class 'int'>
2      <class 'int'>
Name: nulls_int64, dtype: object

So using isna() instead of != None also helps you to keep your code clean from handling pandas-internal data representations.

like image 185
jottbe Avatar answered Oct 24 '22 12:10

jottbe


I'm somewhat going out on a limb here, since I don't have experience with Pandas, but with Python…

Panda's magic filtering through [] is based a lot on operator overloading. In this expression:

df.loc[df['entities'] != None]

df['entities'] is an object which implements the __ne__ method. Which means you're essentially doing:

df.loc[df['entities'].__ne__(None)]

The df['entities'].__ne__(None) is producing some new magic conditions object. The df.loc object implements the __getitem__ method to overload the [] subscript syntax, so the whole thing is essentially:

df.loc.__getitem__(df['entities'].__ne__(None))

On the other hand, the is operator is not overloadable. There's no __is__ method an object could implement, so df['entities'] is not None is evaluated just as is by Python's core rules, and since df['entities'] probably really is not None, the result of that expression is True. So just:

df.loc.__getitem__(True)

And that's why the error message complains about the KeyError: True.

like image 28
deceze Avatar answered Oct 24 '22 10:10

deceze