Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by and find sum for groups but return NaN as NaN, not 0

I have a dataframe where each unique group has 4 rows. So I need to group by columns that makes them unique and does some aggregations such as max, min, sum and average. But the problem is that I have for some group all NaN values (in some column) and returns me a 0. Is it possible to return me a NaN? For example: df

       time            id     el    conn   column1  column2  column3
2018-02-11 14:00:00     1     a      12      8        5         NaN
2018-02-11 14:00:00     1     a      12      1        NaN       NaN
2018-02-11 14:00:00     1     a      12      3        7         NaN
2018-02-11 14:00:00     1     a      12      4        12        NaN
2018-02-11 14:00:00     2     a      5       NaN      5         5
2018-02-11 14:00:00     2     a      5       NaN      3         2
2018-02-11 14:00:00     2     a      5       NaN      NaN       6
2018-02-11 14:00:00     2     a      5       NaN      7         NaN

So, for example, I need to groupby ('id', 'el', 'conn') and find sum for column1, column3 and column2. (In real case I have a lot more columns need to be performed aggregation on). I have tried a few ways: .sum(), .transform('sum'), but returns me a zero for group with all NaN values.

Desired output:

    time               id    el     conn   column1  column2  column3
2018-02-11 14:00:00     1     a      12      16       24       NaN
2018-02-11 14:00:00     2     a      5       NaN      15        13

Any help is welcomed.

like image 632
jovicbg Avatar asked Mar 12 '18 11:03

jovicbg


People also ask

Is there a way to include nan in groupby?

One workaround is to use a placeholder before doing the groupby (e.g. -1): That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).

Is it possible to set Nan count and size to none?

However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs. When these differ, you can set the value back to None for the result of the aggregation function for that group. Highly active question.

Is there a problem with Nan 0 in NumPy?

Sorry, something went wrong. There is inconsistent behavior between pandas and numpy. NaN should not be treated as zero. Also a problem for mean and std as well as sum. For example df_1 = pd. DataFrame ( { 'col1': ( 'a', 'a', 'b', 'c' ), 'col2': ( np. NaN, 2, np.

How does group by handle Nulls?

We can see that the first result value is a NULL represented by an empty string (the empty line before the IT department). This empty space represents all the NULL values returned by the GROUP BY clause, so we can conclude that GROUP BY treats NULLs as valid values.


1 Answers

Change parameter min_count to 1 - this working in last pandas version 0.22.0:

min_count : int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 1. This means the sum or product of an all-NA or empty series is NaN.

df = df.groupby(['time','id', 'el', 'conn'], as_index=False).sum(min_count=1)
print (df)
                  time  id el  conn  column1  column2  column3
0  2018-02-11 14:00:00   1  a    12     16.0     24.0      NaN
1  2018-02-11 14:00:00   2  a     5      NaN     15.0     13.0
like image 83
jezrael Avatar answered Sep 28 '22 06:09

jezrael