Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Groupby Agg Function Does Not Reduce

Tags:

I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.

This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.

This is the function that I have been using:

def MakeList(x):     """ This function is used to aggregate data that needs to be kept distinc within multi day          observations for later use and transformation. It makes a list of the data and if the list is of length 1         then there is only one line/day observation in that group so the single element of the list is returned.          If the list is longer than one then there are multiple line/day observations and the list itself is          returned."""     L = x.tolist()     if len(L) > 1:         return L     else:         return L[0] 

Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:

import pandas as pd DF = pd.DataFrame({'date': ['2013-04-02',                             '2013-04-02',                             '2013-04-02',                             '2013-04-02',                             '2013-04-02',                             '2013-04-02',                             '2013-04-02',                             '2013-04-02',                             '2013-04-02',                             '2013-04-02'],                     'line_code':   ['401101',                                     '401101',                                     '401102',                                     '401103',                                     '401104',                                     '401105',                                     '401105',                                     '401106',                                     '401106',                                     '401107'],                     's.m.v.': [ 7.760,                                 25.564,                                 25.564,                                 9.550,                                 4.870,                                 7.760,                                 25.564,                                 5.282,                                 25.564,                                 5.282]}) DFGrouped = DF.groupby(['date', 'line_code'], as_index = False) DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList}) 

In trying to debug this, I put a print statement to the effect of print L and print x.index and the output was as follows:

[7.7599999999999998, 25.564] Int64Index([0, 1], dtype='int64') [7.7599999999999998, 25.564] Int64Index([0, 1], dtype='int64') 

For some reason it appears that agg is passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.

For example if I write a function like this:

def test_func(x):     print x.index     return x.iloc[0] 

This runs without problem and the print statements are:

DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})  Int64Index([0, 1], dtype='int64') Int64Index([2], dtype='int64') Int64Index([3], dtype='int64') Int64Index([4], dtype='int64') Int64Index([5, 6], dtype='int64') Int64Index([7, 8], dtype='int64') Int64Index([9], dtype='int64') 

Which indicates that each group is only being passed once as a Series to the function.

Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....

Thanks

like image 449
Woody Pride Avatar asked Dec 12 '14 07:12

Woody Pride


2 Answers

I can't really explain you why, but from my experience list in pandas.DataFrame don't work all that well.

I usually use tuple instead. That will work:

def MakeList(x):     T = tuple(x)     if len(T) > 1:         return T     else:         return T[0]  DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})       date line_code           s.m.v. 0  2013-04-02    401101   (7.76, 25.564) 1  2013-04-02    401102           25.564 2  2013-04-02    401103             9.55 3  2013-04-02    401104             4.87 4  2013-04-02    401105   (7.76, 25.564) 5  2013-04-02    401106  (5.282, 25.564) 6  2013-04-02    401107            5.282 
like image 132
paulo.filip3 Avatar answered Jan 09 '23 23:01

paulo.filip3


This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:

def _aggregate_series_pure_python(self, obj, func):      group_index, _, ngroups = self.group_info      counts = np.zeros(ngroups, dtype=int)     result = None      splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)      for label, group in splitter:         res = func(group)         if result is None:             if (isinstance(res, (Series, Index, np.ndarray)) or                     isinstance(res, list)):                 raise ValueError('Function does not reduce')             result = np.empty(ngroups, dtype='O')          counts[label] = group.shape[0]         result[label] = res 

Notice that if result is None and isinstance(res, list. Your options are:

  1. Fake out groupby().agg(), so it doesn't see a list for the first group, or

  2. Do the aggregation yourself, using code like that above but without the erroneous test.

like image 26
Nik Bates-Haus Avatar answered Jan 10 '23 00:01

Nik Bates-Haus