Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to aggregate a boolean field with null values with pandas?

I'm working with pandas for the first time and I'm having some issues with aggregation. I have a dataframe with three calculated fields added by an apply statement like this:

dataset['calculated_field'] = dataset.apply(
    lambda row: calculation_function(
        row['field1'],
        row['field2']
    ),
    axis = 1
)

The calculated fields are booleans, but with a catch. They can contain null values.

I am trying to find the average of the boolean columns, with each average ignoring the null fields for that column.

I've tried something like this:

resultset = dataset.groupby(['grouping_field'])[['calculated_field','calculated_field_2','calculated_field_3']].mean()

The problem is that since the True/False/None boolean is an "object" type, pandas drops the columns entirely as a “nuisance” column..

I can't convert the column to a bool, though, because it makes the null values "False"

I also tried the long route and created 3 seperate dataframes for each aggregate, so I could drop the null values and convert to bool (which can be aggregated):

dataset_for_field1 = dataset.dropna(subset = ['calculated_field']).copy()
dataset_for_field1['calculated_field'] = dataset_for_field1['calculated_field'].astype('bool')

result_for_field1 = dataset_for_field1.groupby(['grouping_field'])['calculated_field'].mean()

This gives me the data I'm looking for, but in three separate dataframes.

Is there a way to get one dataframe with the average of each column, ignoring nulls?

like image 947
jwaltzjr Avatar asked Mar 24 '17 20:03

jwaltzjr


People also ask

Is null and Notnull in pandas?

Python | Pandas isnull() and notnull() While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.


1 Answers

Convert them to numeric columns. The None will become NaN, Trues become 1, and Falses become 0. A convenient way to convert the whole dataframe is to use pd.to_numeric with the errors parameter set to ignore. This will leave the grouping column alone because it will error out on move on.

Consider the dataframe df

df = pd.DataFrame(dict(
        gcol=list('aaaabbbb'),
        clc1=[True, False, True, None] * 2,
        clc2=[True, False, True, False] * 2,
        clc3=[True, True, True, True] * 2,
        clc4=[False, None, None, True]* 2
    ))

This is what converting to numeric looks like

df.apply(pd.to_numeric, errors='ignore')

   clc1   clc2  clc3  clc4 gcol
0   1.0   True  True   0.0    a
1   0.0  False  True   NaN    a
2   1.0   True  True   NaN    a
3   NaN  False  True   1.0    a
4   1.0   True  True   0.0    b
5   0.0  False  True   NaN    b
6   1.0   True  True   NaN    b
7   NaN  False  True   1.0    b

Using this with the subsequent groupby should get you what you want.

df.apply(pd.to_numeric, errors='ignore').groupby('gcol').mean()

          clc1  clc2  clc3  clc4
gcol                            
a     0.666667   0.5   1.0   0.5
b     0.666667   0.5   1.0   0.5
like image 192
piRSquared Avatar answered Oct 16 '22 15:10

piRSquared