Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

df.groupby(...).agg(set) produces different result compared to df.groupby(...).agg(lambda x: set(x))

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results.

Data:

df = pd.DataFrame({
       'user_id': [1, 2, 3, 4, 1, 2, 3], 
       'class_type': ['Krav Maga', 'Yoga', 'Ju-jitsu', 'Krav Maga', 
                      'Ju-jitsu','Krav Maga', 'Karate'], 
       'instructor': ['Bob', 'Alice','Bob', 'Alice','Alice', 'Alice','Bob']})

Demo:

In [36]: df.groupby('user_id').agg(lambda x: set(x))
Out[36]:
                    class_type    instructor
user_id
1        {Krav Maga, Ju-jitsu}  {Alice, Bob}
2            {Yoga, Krav Maga}       {Alice}
3           {Ju-jitsu, Karate}         {Bob}
4                  {Krav Maga}       {Alice}

In [37]: df.groupby('user_id').agg(set)
Out[37]:
                                class_type                         instructor
user_id
1        {user_id, class_type, instructor}  {user_id, class_type, instructor}
2        {user_id, class_type, instructor}  {user_id, class_type, instructor}
3        {user_id, class_type, instructor}  {user_id, class_type, instructor}
4        {user_id, class_type, instructor}  {user_id, class_type, instructor}

I would expect the same behaviour here - do you know what I am missing?

like image 490
MaxU - stop WAR against UA Avatar asked Mar 28 '18 14:03

MaxU - stop WAR against UA


People also ask

How to use agg() function on grouped Dataframe in R?

So to perform the agg, first, you need to perform the groupBy () on DataFrame which groups the records based on single or multiple column values, and then do the agg () to get the aggregate for each group. In this article, I will explain how to use agg () function on grouped DataFrame with examples.

Why do we use groupby and aggregation functions?

Those functions can be used with groupby in order to return statistical information about the groups. In the next section we will cover all aggregation functions with simple examples. Let us use the earthquake dataset.

How to use aggregation functions with groupby in pandas?

It's possible in Pandas to define your own aggfunc and use it with a groupby method. In the next example we will define a function which will compute the NaN values in each group: Finally let's check how to use aggregation functions with groupby from scipy or numpy

What is the difference between to_set() and aggregating set()?

Aggregating set, doesn't result in TypeError: 'type' object is not iterable . Not certain when the functionality was updated. It's because set is of type type whereas to_set is of type function: Function to use for aggregating groups.


2 Answers

OK what is happening here is that set isn't being handled as it's not is_list_like in _aggregate:

elif is_list_like(arg) and arg not in compat.string_types:

see source

this isn't is_list_like so it returns None up the call chain to end up at this line:

results.append(colg.aggregate(a))

see source

this raises TypeError as TypeError: 'type' object is not iterable

which then raises:

if not len(results):
    raise ValueError("no results")

see source

so because we have no results we end up calling _aggregate_generic:

see source

this then calls:

result[name] = self._try_cast(func(data, *args, **kwargs)

see source

This then ends up as:

(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
-> return self._wrap_generic_output(result, obj)

(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}

I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779

So essentially because set doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:

In [8]:

df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]: 
        class_type instructor
user_id                      
1             None       None
2             None       None
3             None       None
4             None       None

but when you use the lambda which is an anonymous function this works as expected.

like image 107
EdChum Avatar answered Oct 20 '22 00:10

EdChum


Perhaps as @Edchum commented agg applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.

df.groupby('user_id').agg(print,end='\n\n')

 class_type instructor  user_id
0  Krav Maga        Bob        1
4   Ju-jitsu      Alice        1

  class_type instructor  user_id
1       Yoga      Alice        2
5  Krav Maga      Alice        2

  class_type instructor  user_id
2   Ju-jitsu        Bob        3
6     Karate        Bob        3


df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))

0    Krav Maga
4     Ju-jitsu
Name: class_type, dtype: object

1         Yoga
5    Krav Maga
Name: class_type, dtype: object

2    Ju-jitsu
6      Karate
Name: class_type, dtype: object

3    Krav Maga
Name: class_type, dtype: object

...

Hope this is the reason why applying set gave the result like the one mentioned above.

like image 2
Bharath Avatar answered Oct 20 '22 00:10

Bharath