I have a dataframe where each unique group has 4 rows. So I need to group by columns that makes them unique and does some aggregations such as max, min, sum and average. But the problem is that I have for some group all NaN values (in some column) and returns me a 0. Is it possible to return me a NaN? For example: df
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 8 5 NaN
2018-02-11 14:00:00 1 a 12 1 NaN NaN
2018-02-11 14:00:00 1 a 12 3 7 NaN
2018-02-11 14:00:00 1 a 12 4 12 NaN
2018-02-11 14:00:00 2 a 5 NaN 5 5
2018-02-11 14:00:00 2 a 5 NaN 3 2
2018-02-11 14:00:00 2 a 5 NaN NaN 6
2018-02-11 14:00:00 2 a 5 NaN 7 NaN
So, for example, I need to groupby ('id', 'el', 'conn') and find sum for column1, column3 and column2. (In real case I have a lot more columns need to be performed aggregation on). I have tried a few ways: .sum(), .transform('sum'), but returns me a zero for group with all NaN values.
Desired output:
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 16 24 NaN
2018-02-11 14:00:00 2 a 5 NaN 15 13
Any help is welcomed.
One workaround is to use a placeholder before doing the groupby (e.g. -1): That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs. When these differ, you can set the value back to None for the result of the aggregation function for that group. Highly active question.
Sorry, something went wrong. There is inconsistent behavior between pandas and numpy. NaN should not be treated as zero. Also a problem for mean and std as well as sum. For example df_1 = pd. DataFrame ( { 'col1': ( 'a', 'a', 'b', 'c' ), 'col2': ( np. NaN, 2, np.
We can see that the first result value is a NULL represented by an empty string (the empty line before the IT department). This empty space represents all the NULL values returned by the GROUP BY clause, so we can conclude that GROUP BY treats NULLs as valid values.
Change parameter min_count
to 1
- this working in last pandas version 0.22.0
:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 1. This means the sum or product of an all-NA or empty series is NaN.
df = df.groupby(['time','id', 'el', 'conn'], as_index=False).sum(min_count=1)
print (df)
time id el conn column1 column2 column3
0 2018-02-11 14:00:00 1 a 12 16.0 24.0 NaN
1 2018-02-11 14:00:00 2 a 5 NaN 15.0 13.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With