How to aggregate a boolean field with null values with pandas?

Tags:

I'm working with pandas for the first time and I'm having some issues with aggregation. I have a dataframe with three calculated fields added by an apply statement like this:

dataset['calculated_field'] = dataset.apply(
    lambda row: calculation_function(
        row['field1'],
        row['field2']
    ),
    axis = 1
)

The calculated fields are booleans, but with a catch. They can contain null values.

I am trying to find the average of the boolean columns, with each average ignoring the null fields for that column.

I've tried something like this:

resultset = dataset.groupby(['grouping_field'])[['calculated_field','calculated_field_2','calculated_field_3']].mean()

The problem is that since the True/False/None boolean is an "object" type, pandas drops the columns entirely as a “nuisance” column..

I can't convert the column to a bool, though, because it makes the null values "False"

I also tried the long route and created 3 seperate dataframes for each aggregate, so I could drop the null values and convert to bool (which can be aggregated):

dataset_for_field1 = dataset.dropna(subset = ['calculated_field']).copy()
dataset_for_field1['calculated_field'] = dataset_for_field1['calculated_field'].astype('bool')

result_for_field1 = dataset_for_field1.groupby(['grouping_field'])['calculated_field'].mean()

This gives me the data I'm looking for, but in three separate dataframes.

Is there a way to get one dataframe with the average of each column, ignoring nulls?

947

asked Mar 24 '17 20:03

jwaltzjr

1 Answers

Convert them to numeric columns. The None will become NaN, Trues become 1, and Falses become 0. A convenient way to convert the whole dataframe is to use pd.to_numeric with the errors parameter set to ignore. This will leave the grouping column alone because it will error out on move on.

Consider the dataframe df

df = pd.DataFrame(dict(
        gcol=list('aaaabbbb'),
        clc1=[True, False, True, None] * 2,
        clc2=[True, False, True, False] * 2,
        clc3=[True, True, True, True] * 2,
        clc4=[False, None, None, True]* 2
    ))

This is what converting to numeric looks like

df.apply(pd.to_numeric, errors='ignore')

   clc1   clc2  clc3  clc4 gcol
0   1.0   True  True   0.0    a
1   0.0  False  True   NaN    a
2   1.0   True  True   NaN    a
3   NaN  False  True   1.0    a
4   1.0   True  True   0.0    b
5   0.0  False  True   NaN    b
6   1.0   True  True   NaN    b
7   NaN  False  True   1.0    b

Using this with the subsequent groupby should get you what you want.

df.apply(pd.to_numeric, errors='ignore').groupby('gcol').mean()

          clc1  clc2  clc3  clc4
gcol                            
a     0.666667   0.5   1.0   0.5
b     0.666667   0.5   1.0   0.5

192

answered Oct 16 '22 15:10

piRSquared

Related questions
                            
                                Firebase catch exception
                            
                                ETL/ Data Warehousing Approach using APIs
                            
                                Converting proto buffer to ProtoRPC
                            
                                How does one find out whether a Windows Media Foundation sink writer needs bottom-up or top-down images?
                            
                                Client-Side templating with nodejs and pug
                            
                                Compile Swift on Mac and run on Windows
                            
                                How to force Nginx to verify upstream certificates against the hostnames present in upstream server block?
                            
                                os.system not working, but typing the same thing into the command prompt works
                            
                                Incrementally finding regular expression matches in streaming data in Python
                            
                                bash add leading zeros to the count while number
                            
                                The requested URL /save was not found on this server in spring MVC
                            
                                Shortest paths without supernodes in Neo4j

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to aggregate a boolean field with null values with pandas?

Tags:

python

pandas

aggregate

jwaltzjr

People also ask

1 Answers

piRSquared

Recent Activity

Donate For Us