I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.
This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.
This is the function that I have been using:
def MakeList(x): """ This function is used to aggregate data that needs to be kept distinc within multi day observations for later use and transformation. It makes a list of the data and if the list is of length 1 then there is only one line/day observation in that group so the single element of the list is returned. If the list is longer than one then there are multiple line/day observations and the list itself is returned.""" L = x.tolist() if len(L) > 1: return L else: return L[0]
Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:
import pandas as pd DF = pd.DataFrame({'date': ['2013-04-02', '2013-04-02', '2013-04-02', '2013-04-02', '2013-04-02', '2013-04-02', '2013-04-02', '2013-04-02', '2013-04-02', '2013-04-02'], 'line_code': ['401101', '401101', '401102', '401103', '401104', '401105', '401105', '401106', '401106', '401107'], 's.m.v.': [ 7.760, 25.564, 25.564, 9.550, 4.870, 7.760, 25.564, 5.282, 25.564, 5.282]}) DFGrouped = DF.groupby(['date', 'line_code'], as_index = False) DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})
In trying to debug this, I put a print statement to the effect of print L
and print x.index
and the output was as follows:
[7.7599999999999998, 25.564] Int64Index([0, 1], dtype='int64') [7.7599999999999998, 25.564] Int64Index([0, 1], dtype='int64')
For some reason it appears that agg
is passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.
For example if I write a function like this:
def test_func(x): print x.index return x.iloc[0]
This runs without problem and the print statements are:
DF_Agg = DFGrouped.agg({'s.m.v.' : test_func}) Int64Index([0, 1], dtype='int64') Int64Index([2], dtype='int64') Int64Index([3], dtype='int64') Int64Index([4], dtype='int64') Int64Index([5, 6], dtype='int64') Int64Index([7, 8], dtype='int64') Int64Index([9], dtype='int64')
Which indicates that each group is only being passed once as a Series to the function.
Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....
Thanks
I can't really explain you why, but from my experience list
in pandas.DataFrame
don't work all that well.
I usually use tuple
instead. That will work:
def MakeList(x): T = tuple(x) if len(T) > 1: return T else: return T[0] DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList}) date line_code s.m.v. 0 2013-04-02 401101 (7.76, 25.564) 1 2013-04-02 401102 25.564 2 2013-04-02 401103 9.55 3 2013-04-02 401104 4.87 4 2013-04-02 401105 (7.76, 25.564) 5 2013-04-02 401106 (5.282, 25.564) 6 2013-04-02 401107 5.282
This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:
def _aggregate_series_pure_python(self, obj, func): group_index, _, ngroups = self.group_info counts = np.zeros(ngroups, dtype=int) result = None splitter = get_splitter(obj, group_index, ngroups, axis=self.axis) for label, group in splitter: res = func(group) if result is None: if (isinstance(res, (Series, Index, np.ndarray)) or isinstance(res, list)): raise ValueError('Function does not reduce') result = np.empty(ngroups, dtype='O') counts[label] = group.shape[0] result[label] = res
Notice that if result is None
and isinstance(res, list
. Your options are:
Fake out groupby().agg(), so it doesn't see a list for the first group, or
Do the aggregation yourself, using code like that above but without the erroneous test.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With