Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bool and missing values in pandas

I am trying to figure out whether or not a column in a pandas dataframe is boolean or not (and if so, if it has missing values and so on).

In order to test the function that I created I tried to create a dataframe with a boolean column with missing values. However, I would say that missing values are handled exclusively 'untyped' in python and there are some weird behaviours:

> boolean = pd.Series([True, False, None])
> print(boolean)

0     True
1    False
2     None
dtype: object

so the moment you put None into the list, it is being regarded as object because python is not able to mix the types bool and type(None)=NoneType back into bool. The same thing happens with math.nan and numpy.nan. The weirdest things happen when you try to force pandas into an area it does not want to go to :-)

> boolean = pd.Series([True, False, np.nan]).astype(bool)
> print(boolean)
0     True
1    False
2     True
dtype: bool

So 'np.nan' is being casted to 'True'?

Questions:

  1. Given a data table where one column is of type 'object' but in fact it is a boolean column with missing values: how do I figure that out? After filtering for the non-missing values it is still of type 'object'... do I need to implement a try-catch-cast of every column into every imaginable data type in order to see the true nature of columns?

  2. I guess that there is a logical explanation of why np.nan is being casted to True but this is an unwanted behaviour of the software pandas/python itself, right? So should I file a bug report?

like image 615
Fabian Werner Avatar asked Aug 28 '19 13:08

Fabian Werner


People also ask

How do you deal with missing values in pandas?

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

Can boolean be null pandas?

Sometimes all of the values in this column are None. Unless I provide explicit type information Pandas will infer the wrong type information for that column. Python's built-in bool class cannot have a Null value. It can only be True or False.

Is NaN and null same in pandas?

Within pandas, a missing value is denoted by NaN . In most cases, the terms missing and null are interchangeable, but to abide by the standards of pandas, we'll continue using missing throughout this tutorial.

How do you fill missing values in a data set?

Use the fillna() Method: The fillna() function iterates through your dataset and fills all null rows with a specified value. It accepts some optional arguments—take note of the following ones: Value: This is the value you want to insert into the missing rows. Method: Lets you fill missing values forward or in reverse.


1 Answers

Q1: I would start with combining

np.any(pd.isna(boolean))

to identify if there are any None Values in a column, and with

set(boolean)

You can identify, if there are only True, False and Nones inside. Combining with filtering (and if you prefer to also typcasting) you should be done.

Q2: see comment of @WeNYoBen

like image 96
Sosel Avatar answered Oct 31 '22 18:10

Sosel