Answering this question it turned out that df.groupby(...).agg(set)
and df.groupby(...).agg(lambda x: set(x))
are producing different results.
Data:
df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 1, 2, 3],
'class_type': ['Krav Maga', 'Yoga', 'Ju-jitsu', 'Krav Maga',
'Ju-jitsu','Krav Maga', 'Karate'],
'instructor': ['Bob', 'Alice','Bob', 'Alice','Alice', 'Alice','Bob']})
Demo:
In [36]: df.groupby('user_id').agg(lambda x: set(x))
Out[36]:
class_type instructor
user_id
1 {Krav Maga, Ju-jitsu} {Alice, Bob}
2 {Yoga, Krav Maga} {Alice}
3 {Ju-jitsu, Karate} {Bob}
4 {Krav Maga} {Alice}
In [37]: df.groupby('user_id').agg(set)
Out[37]:
class_type instructor
user_id
1 {user_id, class_type, instructor} {user_id, class_type, instructor}
2 {user_id, class_type, instructor} {user_id, class_type, instructor}
3 {user_id, class_type, instructor} {user_id, class_type, instructor}
4 {user_id, class_type, instructor} {user_id, class_type, instructor}
I would expect the same behaviour here - do you know what I am missing?
So to perform the agg, first, you need to perform the groupBy () on DataFrame which groups the records based on single or multiple column values, and then do the agg () to get the aggregate for each group. In this article, I will explain how to use agg () function on grouped DataFrame with examples.
Those functions can be used with groupby in order to return statistical information about the groups. In the next section we will cover all aggregation functions with simple examples. Let us use the earthquake dataset.
It's possible in Pandas to define your own aggfunc and use it with a groupby method. In the next example we will define a function which will compute the NaN values in each group: Finally let's check how to use aggregation functions with groupby from scipy or numpy
Aggregating set, doesn't result in TypeError: 'type' object is not iterable . Not certain when the functionality was updated. It's because set is of type type whereas to_set is of type function: Function to use for aggregating groups.
OK what is happening here is that set
isn't being handled as it's not is_list_like
in _aggregate
:
elif is_list_like(arg) and arg not in compat.string_types:
see source
this isn't is_list_like
so it returns None
up the call chain to end up at this line:
results.append(colg.aggregate(a))
see source
this raises TypeError
as TypeError: 'type' object is not iterable
which then raises:
if not len(results):
raise ValueError("no results")
see source
so because we have no results we end up calling _aggregate_generic
:
see source
this then calls:
result[name] = self._try_cast(func(data, *args, **kwargs)
see source
This then ends up as:
(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
-> return self._wrap_generic_output(result, obj)
(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}
I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779
So essentially because set
doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:
In [8]:
df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]:
class_type instructor
user_id
1 None None
2 None None
3 None None
4 None None
but when you use the lambda
which is an anonymous function this works as expected.
Perhaps as @Edchum commented agg
applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.
df.groupby('user_id').agg(print,end='\n\n')
class_type instructor user_id
0 Krav Maga Bob 1
4 Ju-jitsu Alice 1
class_type instructor user_id
1 Yoga Alice 2
5 Krav Maga Alice 2
class_type instructor user_id
2 Ju-jitsu Bob 3
6 Karate Bob 3
df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))
0 Krav Maga
4 Ju-jitsu
Name: class_type, dtype: object
1 Yoga
5 Krav Maga
Name: class_type, dtype: object
2 Ju-jitsu
6 Karate
Name: class_type, dtype: object
3 Krav Maga
Name: class_type, dtype: object
...
Hope this is the reason why applying set gave the result like the one mentioned above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With