Came across this seemingly odd behaviour while discussing https://stackoverflow.com/a/47543066/9017455.
The OP had this dataframe:
x = pd.DataFrame.from_dict({
'cat1':['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'cat2':['X', 'X', 'Y', 'Y', 'Y', 'Y', 'Z', 'Z']})
and wanted to find unique cat2
values for each group of cat1
values.
One option is to aggregate and use a lambda to create a set of unique values:
x.groupby('cat1').agg(lambda x: set(x))
# Returns
cat2
cat1
A {X, Y}
B {Y}
C {Z, Y}
I assumed using set
on its own would be equivalent to the lambda here, since it is callable, however:
x.groupby('cat1').agg(set)
# Returns
cat2
cat1
A {cat1, cat2}
B {cat1, cat2}
C {cat1, cat2}
I get the same behaviour as the lambda
method if I define a proper function, and by doing that I can see that pandas calls that function with a Series
. It appears that set
is being called with a DataFrame
, hence it returns the set of column names when iterating over the object.
This seems like inconsistent behaviour. Can anyone shed some light on why Pandas treats the builtin functions differently?
Looking at how SeriesGroupBy.agg
behaves might provide some more insight. Passing any type to this function results in an error "TypeError: 'type' object is not iterable".
x.groupby('cat1')['cat2'].agg(set)
This behaviour seems to have changed by now. At least here in version 0.23.0, both lambda x: set(x)
and set
behave identically:
In [6]: x.groupby('cat1').agg(set)
Out[6]:
cat2
cat1
A {Y, X}
B {Y}
C {Y, Z}
In [7]: x.groupby('cat1').agg(lambda x: set(x))
Out[7]:
cat2
cat1
A {Y, X}
B {Y}
C {Y, Z}
I could not positively identify the change, but bug #16405 looks suspiciously related (although the fix was already released with 0.20.2 in June 2017, long before this question...).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With