Pandas groupby apply vs transform with specific functions

Question

I don't understand which functions are acceptable for groupby + transform operations. Often, I end up just guessing, testing, reverting until something works, but I feel there should be a systematic way of determining whether a solution will work.

Here's a minimal example. First let's use groupby + apply with set:

df = pd.DataFrame({'a': [1,2,3,1,2,3,3], 'b':[1,2,3,1,2,3,3], 'type':[1,0,1,0,1,0,1]})

g = df.groupby(['a', 'b'])['type'].apply(set)

print(g)

a  b
1  1    {0, 1}
2  2    {0, 1}
3  3    {0, 1}

This works fine, but I want the resulting set calculated groupwise in a new column of the original dataframe. So I try and use transform:

df['g'] = df.groupby(['a', 'b'])['type'].transform(set)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
---> 23 df['g'] = df.groupby(['a', 'b'])['type'].transform(set)

TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'

This is the error I see in Pandas v0.19.0. In v0.23.0, I see TypeError: 'set' type is unordered. Of course, I can map a specifically defined index to achieve my result:

g = df.groupby(['a', 'b'])['type'].apply(set)
df['g'] = df.set_index(['a', 'b']).index.map(g.get)

print(df)

   a  b  type       g
0  1  1     1  {0, 1}
1  2  2     0  {0, 1}
2  3  3     1  {0, 1}
3  1  1     0  {0, 1}
4  2  2     1  {0, 1}
5  3  3     0  {0, 1}
6  3  3     1  {0, 1}

But I thought the benefit of transform was to avoid such an explicit mapping. Where did I go wrong?

rafaelc · Accepted Answer

I believe, in the first place, that there is some room for intuition in using these functions as they can be very meaningful.

In your first result, you are not actually trying to transform your values, but rather to aggregate them (which would work in the way you intended).

But getting into code, the transform docs are quite suggestive in saying that

Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk.

When you do

df.groupby(['a', 'b'])['type'].transform(some_func)

You are actually transforming each pd.Series object from each group into a new object using your some_func function. But the thing is, this new object should have the same size as the group OR be broadcastable to the size of the chunk.

Therefore, if you transform your series using tuple or list, you will be basically transforming the object

0    1
1    2
2    3
dtype: int64

into

[1,2,3]

But notice that these values are now assigned back to their respective indexes and that is why you see no difference in the transform operation. The row that had the .iloc[0] value from the pd.Series will now have the [1,2,3][0] value from the transform list (the same would apply to tuple) etc. Notice that ordering and size here matters, because otherwise you could mess up your groups and the transform wouldn't work (and this is exactly why set is not a proper function to be used is this case).

The second part of the quoted text says "broadcastable to the size of the group chunk".

This means that you can also transform your pd.Series to an object that can be used in all rows. For example

df.groupby(['a', 'b'])['type'].transform(lambda k: 50)

would work. Why? even though 50 is not iterable, it is broadcastable by using this value repeatedly in all positions of your initial pd.Series.

Why can you apply using set?

Because the apply method doesn't have this constraint of size in the result. It actually has three different result types, and it infers whether you want to expand, reduce or broadcast your results. Notice that you can't reduce in transforming*

By default (result_type=None), the final return type is inferred from the return type of the applied function. result_type : {‘expand’, ‘reduce’, ‘broadcast’, None}, default None These only act when axis=1 (columns):

‘expand’ : list-like results will be turned into columns.

‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.

‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

igrinis · Answer

The result of the transformation is restricted to certain types. [For example it can't be list, set, Series etc. -- This is incorrect, thank you @RafaelC for comment] I don't think this is documented, but when examining the source code of groupby.py and series.py you can find those type restrictions.

From the groupby documentation

The transform method returns an object that is indexed the same (same size) as the one being grouped. The transform function must:

Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).

Operate column-by-column on the group chunk. The transform is applied to the first group chunk using chunk.apply.

Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results. For example, when using fillna, inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False))).

(Optionally) operates on the entire group chunk. If this is supported, a fast path is used starting from the second chunk.

Disclaimer: I got different error (pandas version 0.23.1):

df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
File "***/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3661, in transform
s = klass(res, indexer)        s = klass(res, indexer)
File "***/lib/python3.6/site-packages/pandas/core/series.py", line 242, in __init__
"".format(data.__class__.__name__))
TypeError: 'set' type is unordered

Update

After transforming the group into a set, pandas can't broadcast it to the Series, because it is unordered (and have different dimensions than the group chunk) . If we force it into a list it will became same size as the group chunk, and we get only single value per row. The answer is to wrap it around in some container, so the resulting size of the object will become 1, and then pandas will be able to broadcast it:

df['g'] = df.groupby(['a', 'b'])['type'].transform(lambda x: np.array(set(x)))
print(df)

   a  b  type       g
0  1  1     1  {0, 1}
1  2  2     0  {0, 1}
2  3  3     1  {0, 1}
3  1  1     0  {0, 1}
4  2  2     1  {0, 1}
5  3  3     0  {0, 1}
6  3  3     1  {0, 1}

Why I chose np.array as a container? Because series.py (line 205:206) pass this type without further checks. So I believe this behavior will be preserved in future versions.

Pandas groupby apply vs transform with specific functions

Tags:

python

pandas

dataframe

pandas-groupby

jpp

2 Answers

rafaelc

Update

igrinis

Recent Activity

Donate For Us

Pandas groupby apply vs transform with specific functions

Tags:

python

pandas

dataframe

pandas-groupby

jpp

2 Answers

rafaelc

Update

igrinis

Related questions

Recent Activity

Donate For Us