Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas unique value ignoring NaN

I want to use unique in groupby aggregation, but I don't want nan in the unique result.

An example dataframe:

df = pd.DataFrame({'a': [1, 2, 1, 1, np.nan, 3, 3], 'b': [0,0,1,1,1,1,1],
    'c': ['foo', np.nan, 'bar', 'foo', 'baz', 'foo', 'bar']})

       a  b    c
0 1.0000  0  foo
1 2.0000  0  NaN
2 1.0000  1  bar
3 1.0000  1  foo
4    nan  1  baz
5 3.0000  1  foo
6 3.0000  1  bar

And the groupby:

df.groupby('b').agg({'a': ['min', 'max', 'unique'], 'c': ['first', 'last', 'unique']})

Its result is:

       a                             c                      
     min    max           unique first last           unique
b                                                           
0 1.0000 2.0000       [1.0, 2.0]   foo  foo       [foo, nan]
1 1.0000 3.0000  [1.0, nan, 3.0]   bar  bar  [bar, foo, baz]

But I want it without nan:

       a                        c                      
     min    max      unique first last           unique
b                                                           
0 1.0000 2.0000  [1.0, 2.0]   foo  foo            [foo]
1 1.0000 3.0000  [1.0, 3.0]   bar  bar  [bar, foo, baz]

How can I do that? Of course I have several columns to aggregate and every column needs different aggregation functions, so I don't want to do the unique aggregations one-by-one and separately from other aggregations.

like image 723
ragesz Avatar asked Sep 14 '17 12:09

ragesz


People also ask

How do I skip NaN in pandas?

Use dropna() function to drop rows with NaN / None values in pandas DataFrame.

Does pandas mean ignore NaN?

pandas mean() Key PointsBy default ignore NaN values and performs mean on index axis.

How do I remove unique NaN?

To ignore NaN values while returning unique values, you can simply chain the dropna function and the unique function.


3 Answers

Define a function:

def unique_non_null(s):
    return s.dropna().unique()

Then use it in the aggregation:

df.groupby('b').agg({
    'a': ['min', 'max', unique_non_null], 
    'c': ['first', 'last', unique_non_null]
})
like image 140
IanS Avatar answered Oct 14 '22 02:10

IanS


This will work for what you need:

df.fillna(method='ffill').groupby('b').agg({'a': ['min', 'max', 'unique'], 'c': ['first', 'last', 'unique']})

Because you use min, max and unique repeated values do not concern you.

like image 36
zipa Avatar answered Oct 14 '22 01:10

zipa


Update 23 November 2020

This answer is terrible, don't use this. Please refer @IanS's answer.

Earlier

Try ffill

df.ffill().groupby('b').agg({'a': ['min', 'max', 'unique'], 'c': ['first', 'last', 'unique']})
      c                          a                 
  first last           unique  min  max      unique
b                                                  
0   foo  foo            [foo]  1.0  2.0  [1.0, 2.0]
1   bar  bar  [bar, foo, baz]  1.0  3.0  [1.0, 3.0]

If Nan is the first element of the group then the above solution breaks.

like image 2
Bharath Avatar answered Oct 14 '22 02:10

Bharath