Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Why does pandas groupby().transform() require a unique index?




I want to use groupby().transform() to do a custom (cumulative) transform of each block of records in a (sorted) dataset. Unless I ensure I have a unique key, it doesn't work. Why?

Here's a toy example:

df = pd.DataFrame([[1,1],
                  columns='a b'.split())
df['partials'] = df.groupby('a')['b'].transform(np.cumsum)

gives the expected:

     a   b   partials
0    1   1   1
1    1   2   3
2    2   3   3
3    3   4   4
4    3   5   9

but if 'a' is a key, it all goes wrong:

df = df.set_index('a')
df['partials'] = df.groupby(level=0)['b'].transform(np.cumsum)

Exception                                 Traceback (most recent call last)
<ipython-input-146-d0c35a4ba053> in <module>()
      4 df = df.set_index('a')
----> 5 df.groupby(level=0)['b'].transform(np.cumsum)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
   1542             res = wrapper(group)
   1543             # result[group.index] = res
-> 1544             indexer = self.obj.index.get_indexer(group.index)
   1545             np.put(result, indexer, res)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit)
    848         if not self.is_unique:
--> 849             raise Exception('Reindexing only valid with uniquely valued Index '
    850                             'objects')

Exception: Reindexing only valid with uniquely valued Index objects

Same error if you select column 'b' before grouping, ie.


but you can make it work if you transform the entire dataframe, like:


or even a one-column dataframe (rather than series):


I feel like there's some still some deep part of GroupBy-fu that I'm missing. Can someone set me straight?

like image 372
patricksurry Avatar asked May 01 '13 02:05


1 Answers

This was a bug, since fixed in pandas (certainly in 0.15.2, IIRC it was fixed in 0.14), so you should no longer see this exception.

As a workaround, in earlier pandas you can use apply:

In [10]: g = df.groupby(level=0)['b']

In [11]: g.apply(np.cumsum)
1    1
1    3
2    3
3    4
3    9
dtype: int64

and you can assign this to a column in df

In [12]: df['partial'] = g.apply(np.cumsum)
like image 170
Andy Hayden Avatar answered Oct 26 '22 07:10

Andy Hayden