I'm working with pandas for the first time and I'm having some issues with aggregation. I have a dataframe with three calculated fields added by an apply statement like this:
dataset['calculated_field'] = dataset.apply(
lambda row: calculation_function(
row['field1'],
row['field2']
),
axis = 1
)
The calculated fields are booleans, but with a catch. They can contain null values.
I am trying to find the average of the boolean columns, with each average ignoring the null fields for that column.
I've tried something like this:
resultset = dataset.groupby(['grouping_field'])[['calculated_field','calculated_field_2','calculated_field_3']].mean()
The problem is that since the True/False/None boolean is an "object" type, pandas drops the columns entirely as a “nuisance” column..
I can't convert the column to a bool, though, because it makes the null values "False"
I also tried the long route and created 3 seperate dataframes for each aggregate, so I could drop the null values and convert to bool (which can be aggregated):
dataset_for_field1 = dataset.dropna(subset = ['calculated_field']).copy()
dataset_for_field1['calculated_field'] = dataset_for_field1['calculated_field'].astype('bool')
result_for_field1 = dataset_for_field1.groupby(['grouping_field'])['calculated_field'].mean()
This gives me the data I'm looking for, but in three separate dataframes.
Is there a way to get one dataframe with the average of each column, ignoring nulls?
Python | Pandas isnull() and notnull() While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.
Convert them to numeric columns. The None
will become NaN
, True
s become 1
, and False
s become 0
. A convenient way to convert the whole dataframe is to use pd.to_numeric
with the errors
parameter set to ignore
. This will leave the grouping column alone because it will error out on move on.
Consider the dataframe df
df = pd.DataFrame(dict(
gcol=list('aaaabbbb'),
clc1=[True, False, True, None] * 2,
clc2=[True, False, True, False] * 2,
clc3=[True, True, True, True] * 2,
clc4=[False, None, None, True]* 2
))
This is what converting to numeric looks like
df.apply(pd.to_numeric, errors='ignore')
clc1 clc2 clc3 clc4 gcol
0 1.0 True True 0.0 a
1 0.0 False True NaN a
2 1.0 True True NaN a
3 NaN False True 1.0 a
4 1.0 True True 0.0 b
5 0.0 False True NaN b
6 1.0 True True NaN b
7 NaN False True 1.0 b
Using this with the subsequent groupby
should get you what you want.
df.apply(pd.to_numeric, errors='ignore').groupby('gcol').mean()
clc1 clc2 clc3 clc4
gcol
a 0.666667 0.5 1.0 0.5
b 0.666667 0.5 1.0 0.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With