I want to use unique in groupby aggregation, but I don't want nan in the unique result.
An example dataframe:
df = pd.DataFrame({'a': [1, 2, 1, 1, np.nan, 3, 3], 'b': [0,0,1,1,1,1,1],
    'c': ['foo', np.nan, 'bar', 'foo', 'baz', 'foo', 'bar']})
       a  b    c
0 1.0000  0  foo
1 2.0000  0  NaN
2 1.0000  1  bar
3 1.0000  1  foo
4    nan  1  baz
5 3.0000  1  foo
6 3.0000  1  bar
And the groupby:
df.groupby('b').agg({'a': ['min', 'max', 'unique'], 'c': ['first', 'last', 'unique']})
Its result is:
       a                             c                      
     min    max           unique first last           unique
b                                                           
0 1.0000 2.0000       [1.0, 2.0]   foo  foo       [foo, nan]
1 1.0000 3.0000  [1.0, nan, 3.0]   bar  bar  [bar, foo, baz]
But I want it without nan:
       a                        c                      
     min    max      unique first last           unique
b                                                           
0 1.0000 2.0000  [1.0, 2.0]   foo  foo            [foo]
1 1.0000 3.0000  [1.0, 3.0]   bar  bar  [bar, foo, baz]
How can I do that? Of course I have several columns to aggregate and every column needs different aggregation functions, so I don't want to do the unique aggregations one-by-one and separately from other aggregations.
Use dropna() function to drop rows with NaN / None values in pandas DataFrame.
pandas mean() Key PointsBy default ignore NaN values and performs mean on index axis.
To ignore NaN values while returning unique values, you can simply chain the dropna function and the unique function.
Define a function:
def unique_non_null(s):
    return s.dropna().unique()
Then use it in the aggregation:
df.groupby('b').agg({
    'a': ['min', 'max', unique_non_null], 
    'c': ['first', 'last', unique_non_null]
})
                        This will work for what you need:
df.fillna(method='ffill').groupby('b').agg({'a': ['min', 'max', 'unique'], 'c': ['first', 'last', 'unique']})
Because you use min, max and unique repeated values do not concern you.
This answer is terrible, don't use this. Please refer @IanS's answer.
Try ffill
df.ffill().groupby('b').agg({'a': ['min', 'max', 'unique'], 'c': ['first', 'last', 'unique']})
      c                          a                 
  first last           unique  min  max      unique
b                                                  
0   foo  foo            [foo]  1.0  2.0  [1.0, 2.0]
1   bar  bar  [bar, foo, baz]  1.0  3.0  [1.0, 3.0]
If Nan is the first element of the group then the above solution breaks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With