I'm currently playing with Pandas dataframe, I wanted to select all data entries in the dataframe which has not None entity attribute.
df_ = df.loc[df['entities'] != None]
seems work pretty well but
df_ = df.loc[df['entities'] is not None]
will raise KeyError which is
File "pandas_libs\index.pyx", line 107, in >pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index.pyx", line 128, in >pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index_class_helper.pxi", line 91, in >pandas._libs.index.Int64Engine._check_type KeyError: True
Well I already know the solution to my original problem, I'm just curious why this happens
Instead of
df_ = df.loc[df['entities'] is not None]
or
df_ = df.loc[df['entities'] != None]
you should rather use
df_ = df.loc[df['entities'].isna()]
Because the representation of missing values in pandas is different from the usual python way to represent missing values by None. In particular you get the key error, because the column series df['entities']
is checked for identity with None
. This evaluates to True
in any case, because the series is not None
. Then the .loc
searches the row index for True
, which is not present in your case, so it raises the exception. !=
doesn't cause this exception to be raised, because the equality operator is overloaded by pandas.Series
(otherwise you couldn't build indexers by comparing a column with a fixed value like in df['name'] == 'Miller'
). This overloaded method performs an elementwise comparison and itself returns an indexer that works fine with the .loc
method. Just the result might not be, what you intended.
E.g. if you do
import pandas as pd
import numpy as np
df= pd.DataFrame(dict(x=[1,2,3], y=list('abc'), nulls= [None, np.NaN, np.float32('inf')]))
df['nulls'].isna()
It returns:
Out[18]:
0 True
1 True
2 False
Name: nulls, dtype: bool
but the code:
df['nulls'] == None
returns
Out[20]:
0 False
1 False
2 False
Name: nulls, dtype: bool
If you look at the datatype of the objects stored in the column, you see they are all floats:
df['nulls'].map(type)
Out[19]:
0 <class 'float'>
1 <class 'float'>
2 <class 'float'>
Name: nulls, dtype: object
For columns of other types the representation of missing values might even be different. E.g. if you use Int64
columns it looks like this:
df['nulls_int64']= pd.Series([None, 1 , 2], dtype='Int64')
df['nulls_int64'].map(type)
Out[26]:
0 <class 'float'>
1 <class 'int'>
2 <class 'int'>
Name: nulls_int64, dtype: object
So using isna()
instead of != None
also helps you to keep your code clean from handling pandas-internal data representations.
I'm somewhat going out on a limb here, since I don't have experience with Pandas, but with Python…
Panda's magic filtering through []
is based a lot on operator overloading. In this expression:
df.loc[df['entities'] != None]
df['entities']
is an object which implements the __ne__
method. Which means you're essentially doing:
df.loc[df['entities'].__ne__(None)]
The df['entities'].__ne__(None)
is producing some new magic conditions object. The df.loc
object implements the __getitem__
method to overload the []
subscript syntax, so the whole thing is essentially:
df.loc.__getitem__(df['entities'].__ne__(None))
On the other hand, the is
operator is not overloadable. There's no __is__
method an object could implement, so df['entities'] is not None
is evaluated just as is by Python's core rules, and since df['entities']
probably really is not None
, the result of that expression is True
. So just:
df.loc.__getitem__(True)
And that's why the error message complains about the KeyError: True
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With