Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas aggregation ignoring NaN's

I aggregate my Pandas dataframe: data. Specifically, I want to get the average and sum amounts by tuples of [origin and type]. For averaging and summing I tried the numpy functions below:

import numpy as np
import pandas as pd
result = data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum, pd.Series.mean]}).reset_index() 

My issue is that the amount column includes NaNs, which causes the result of the above code to have a lot of NaN average and sums.

I know both pd.Series.sum and pd.Series.mean have skipna=True by default, so why am I still getting NaNs here?

I also tried this, which obviously did not work:

data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum(skipna=True), pd.Series.mean(skipna=True)]}).reset_index() 

EDIT: Upon @Korem's suggestion, I also tried to use a partial as below:

s_na_mean = partial(pd.Series.mean, skipna = True)    
data.groupby(groupbyvars).agg({'amount': [ np.nansum, s_na_mean ]}).reset_index() 

but get this error:

error: 'functools.partial' object has no attribute '__name__'
like image 351
Zhubarb Avatar asked Oct 01 '14 16:10

Zhubarb


People also ask

How do you skip NA in Python?

You can include NaN by setting skipna=False. You can also drop all NaN rows from DataFrame using dropna() method.

How do you do aggregation in pandas?

Pandas DataFrame aggregate() MethodThe aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.


2 Answers

Use numpy's nansum and nanmean:

from numpy import nansum
from numpy import nanmean
data.groupby(groupbyvars).agg({'amount': [ nansum, nanmean]}).reset_index() 

As a workaround for older version of numpy, and also a way to fix your last try:

When you do pd.Series.sum(skipna=True) you actually call the method. If you want to use it like this you want to define a partial. So if you don't have nanmean, let's define s_na_mean and use that:

from functools import partial
s_na_mean = partial(pd.Series.mean, skipna = True)
like image 153
Korem Avatar answered Sep 22 '22 12:09

Korem


It might be too late but anyways it might be useful for others.

Try apply function:

import numpy as np
import pandas as pd

def nan_agg(x):
    res = {}

    res['nansum'] = x.loc[ not x['amount'].isnull(), :]['amount'].sum()
    res['nanmean'] = x.loc[ not x['amount'].isnull(), :]['amount'].mean()

    return pd.Series(res, index=['nansum', 'nanmean'])

result = data.groupby(groupbyvars).apply(nan_agg).reset_index() 
like image 38
Miros Avatar answered Sep 22 '22 12:09

Miros