Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe groupby + agg + lambda + unique throwing a ValueError

I have a table that looks like this called rev_df.

       pcid     date        rep     rev    new_rev  diff    Period
0      523468   2017-01-01  1127    16.60   0       NaN     1
1      523468   2017-01-02  1127    41.32   0       1       1
2      523468   2017-01-03  4568    52.39   0       1       1
3      523468   2017-01-04  4568    47.31   0       1       2

This is the line of code in question that's causing some PROBLEMS™.

rev_df_period = rev_df.groupby(['pcid', 'Period']).agg({'date': [np.min,np.max], 
                                                        'rev':np.sum,
                                                        'new_prod_rev':np.sum,
                                                        'historical_sales_rep': lambda x: x.unique()
                                                       })

The lambda x: x.unique() is causing the following error:

ValueError: Function does not reduce

Through testing, I found that if I change the last agg lambda function to .nunique(), it doesn't throw an error. But I need the list of unique rep values, not the number of values.

Any ideas?

The output should look like this:

                new_rev        date              rev      rep
                sum     amin         amax        sum      unique
pcid    Period                      
523468  1       0       2017-01-01   2017-02-01  1026.94  [1127,4568]
        2       0       2017-03-24   2017-03-30  90.00    4568
like image 562
Cassie Beth Avatar asked Feb 07 '26 04:02

Cassie Beth


1 Answers

You can try this:

df.groupby(['pcid', 'Period']).agg({'date': [np.min,np.max], 
                                                        'rev':np.sum,
                                                        'new_rev':np.sum,
                                                        'rep': lambda x: list(set(x))
                                                       })

Output:

                     date                 rev new_rev           rep
                     amin        amax     sum     sum      <lambda>
pcid   Period                                                      
523468 1       2017-01-01  2017-01-03  110.31       0  [4568, 1127]
       2       2017-01-04  2017-01-04   47.31       0        [4568]

Edit to get proper column naming

f = lambda x: list(set(x))
f.__name__ = 'unique'

rev_df.groupby(['pcid', 'Period']).agg({'date': [np.min,np.max], 
                                                        'rev':np.sum,
                                                        'new_rev':np.sum,
                                                        'rep': f
                                                       })

Output:

                     date                 rev new_rev           rep
                     amin        amax     sum     sum        unique
pcid   Period                                                      
523468 1       2017-01-01  2017-01-03  110.31       0  [4568, 1127]
       2       2017-01-04  2017-01-04   47.31       0        [4568]
like image 109
Scott Boston Avatar answered Feb 09 '26 08:02

Scott Boston