I don't understand which functions are acceptable for groupby
+ transform
operations. Often, I end up just guessing, testing, reverting until something works, but I feel there should be a systematic way of determining whether a solution will work.
Here's a minimal example. First let's use groupby
+ apply
with set
:
df = pd.DataFrame({'a': [1,2,3,1,2,3,3], 'b':[1,2,3,1,2,3,3], 'type':[1,0,1,0,1,0,1]})
g = df.groupby(['a', 'b'])['type'].apply(set)
print(g)
a b
1 1 {0, 1}
2 2 {0, 1}
3 3 {0, 1}
This works fine, but I want the resulting set
calculated groupwise in a new column of the original dataframe. So I try and use transform
:
df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
---> 23 df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'
This is the error I see in Pandas v0.19.0. In v0.23.0, I see TypeError: 'set' type is unordered
. Of course, I can map a specifically defined index to achieve my result:
g = df.groupby(['a', 'b'])['type'].apply(set)
df['g'] = df.set_index(['a', 'b']).index.map(g.get)
print(df)
a b type g
0 1 1 1 {0, 1}
1 2 2 0 {0, 1}
2 3 3 1 {0, 1}
3 1 1 0 {0, 1}
4 2 2 1 {0, 1}
5 3 3 0 {0, 1}
6 3 3 1 {0, 1}
But I thought the benefit of transform
was to avoid such an explicit mapping. Where did I go wrong?
I believe, in the first place, that there is some room for intuition in using these functions as they can be very meaningful.
In your first result, you are not actually trying to transform your values, but rather to aggregate them (which would work in the way you intended).
But getting into code, the transform
docs are quite suggestive in saying that
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk.
When you do
df.groupby(['a', 'b'])['type'].transform(some_func)
You are actually transforming each pd.Series
object from each group into a new object using your some_func
function. But the thing is, this new object should have the same size as the group OR be broadcastable to the size of the chunk.
Therefore, if you transform your series using tuple
or list
, you will be basically transforming the object
0 1
1 2
2 3
dtype: int64
into
[1,2,3]
But notice that these values are now assigned back to their respective indexes and that is why you see no difference in the transform
operation. The row that had the .iloc[0]
value from the pd.Series
will now have the [1,2,3][0]
value from the transform list (the same would apply to tuple) etc. Notice that ordering and size here matters, because otherwise you could mess up your groups and the transform wouldn't work (and this is exactly why set
is not a proper function to be used is this case).
The second part of the quoted text says "broadcastable to the size of the group chunk".
This means that you can also transform your pd.Series
to an object that can be used in all rows. For example
df.groupby(['a', 'b'])['type'].transform(lambda k: 50)
would work. Why? even though 50
is not iterable, it is broadcastable by using this value repeatedly in all positions of your initial pd.Series
.
Why can you apply
using set?
Because the apply
method doesn't have this constraint of size in the result. It actually has three different result types, and it infers whether you want to expand, reduce or broadcast your results. Notice that you can't reduce in transforming*
By default (
result_type=None
), the final return type is inferred from the return type of the applied function. result_type : {‘expand’, ‘reduce’, ‘broadcast’, None}, default None These only act whenaxis=1
(columns):
‘expand’ : list-like results will be turned into columns.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
The result of the transformation is restricted to certain types. [For example it can't be list
, set
, Series
etc. -- This is incorrect, thank you @RafaelC for comment] I don't think this is documented, but when examining the source code of groupby.py
and series.py
you can find those type restrictions.
From the groupby
documentation
The
transform
method returns an object that is indexed the same (same size) as the one being grouped. The transform function must:
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
Operate column-by-column on the group chunk. The transform is applied to the first group chunk using chunk.apply.
Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results. For example, when using fillna, inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False))).
(Optionally) operates on the entire group chunk. If this is supported, a fast path is used starting from the second chunk.
Disclaimer: I got different error (pandas
version 0.23.1):
df['g'] = df.groupby(['a', 'b'])['type'].transform(set)
File "***/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3661, in transform
s = klass(res, indexer) s = klass(res, indexer)
File "***/lib/python3.6/site-packages/pandas/core/series.py", line 242, in __init__
"".format(data.__class__.__name__))
TypeError: 'set' type is unordered
After transforming the group into a set, pandas
can't broadcast it to the Series
, because it is unordered (and have different dimensions than the group chunk) . If we force it into a list it will became same size as the group chunk, and we get only single value per row. The answer is to wrap it around in some container, so the resulting size of the object will become 1, and then pandas
will be able to broadcast it:
df['g'] = df.groupby(['a', 'b'])['type'].transform(lambda x: np.array(set(x)))
print(df)
a b type g
0 1 1 1 {0, 1}
1 2 2 0 {0, 1}
2 3 3 1 {0, 1}
3 1 1 0 {0, 1}
4 2 2 1 {0, 1}
5 3 3 0 {0, 1}
6 3 3 1 {0, 1}
Why I chose np.array
as a container? Because series.py
(line 205:206) pass this type without further checks. So I believe this behavior will be preserved in future versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With