Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas.eval with a boolean series with missing data

Description

I am using pandas.eval on a boolean series with missing data.

To do this I use an indexer to mark non-null values and .loc to only apply .eval on the rows with non-missing data.

Applying the logical not operator using the expression ~bool or not(bool) returns -1 or -2.

I understand that this is because my boolean series is casted as object type because of the missing values, but I am wondering :

  • Why the -1 and -2 output ?
  • What would be the proper way to use .eval on a boolean series with missing data ?

Example

Here is a reproducible example using pandas 0.20.3.

df = pd.DataFrame({'bool': [True, False, None]})
    bool
0   True
1  False
2   None

indexer = ~pd.isnull(df['bool'])
0     True
1     True
2    False
Name: bool, dtype: bool

df.loc[indexer].eval('~bool')
0    -2
1    -1
Name: bool, dtype: object
like image 455
Alex Avatar asked Nov 15 '17 10:11

Alex


People also ask

How do you deal with missing values in pandas?

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

Which method of pandas allows you to replace missing values?

fillna() method is used to replace missing values with a specified value. This method replaces the Nan or NA values in the entire series object. Value − it allows us to specify a particular value to replace Nan's, by default it takes None.

What does ISNA () do in pandas?

isna. Detect missing values for an array-like object. This function takes a scalar or array-like object and indicates whether values are missing ( NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).


1 Answers

For eval, ~ maps to op.invert as seen in the source code here.

_unary_ops_syms = '+', '-', '~', 'not'
_unary_ops_funcs = op.pos, op.neg, op.invert, op.invert
_unary_ops_dict = dict(zip(_unary_ops_syms, _unary_ops_funcs))

Thus when your Series is of good old object type, what you're seeing here is

>>> ~True
-2
>>> ~False
-1

# or with your Series
>>> ~pd.Series(True, dtype='object')
0    -2
dtype: object

Where you want

>>> ~pd.Series(True)
0    False
dtype: bool

The outputs ~True -> -2 and ~False -> -1 are because bool is a subclass of int in Python, and -2, -1 are the bitwise complements of 1 and 0 respectively.


The obvious solution is to either convert the Series to bool type beforehand with astype(bool) in an extra setp, or if for some reason you cannot do so before the eval,

>>> df.loc[indexer].eval('~bool.astype("bool")')
0    False
1     True
Name: bool, dtype: bool
like image 74
miradulo Avatar answered Nov 07 '22 08:11

miradulo