df.groupby(...).agg(set) produces different result compared to df.groupby(...).agg(lambda x: set(x))

Tags:

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results.

Data:

df = pd.DataFrame({
       'user_id': [1, 2, 3, 4, 1, 2, 3], 
       'class_type': ['Krav Maga', 'Yoga', 'Ju-jitsu', 'Krav Maga', 
                      'Ju-jitsu','Krav Maga', 'Karate'], 
       'instructor': ['Bob', 'Alice','Bob', 'Alice','Alice', 'Alice','Bob']})

Demo:

In [36]: df.groupby('user_id').agg(lambda x: set(x))
Out[36]:
                    class_type    instructor
user_id
1        {Krav Maga, Ju-jitsu}  {Alice, Bob}
2            {Yoga, Krav Maga}       {Alice}
3           {Ju-jitsu, Karate}         {Bob}
4                  {Krav Maga}       {Alice}

In [37]: df.groupby('user_id').agg(set)
Out[37]:
                                class_type                         instructor
user_id
1        {user_id, class_type, instructor}  {user_id, class_type, instructor}
2        {user_id, class_type, instructor}  {user_id, class_type, instructor}
3        {user_id, class_type, instructor}  {user_id, class_type, instructor}
4        {user_id, class_type, instructor}  {user_id, class_type, instructor}

I would expect the same behaviour here - do you know what I am missing?

490

asked Mar 28 '18 14:03

MaxU - stop WAR against UA

2 Answers

OK what is happening here is that set isn't being handled as it's not is_list_like in _aggregate:

elif is_list_like(arg) and arg not in compat.string_types:

see source

this isn't is_list_like so it returns None up the call chain to end up at this line:

results.append(colg.aggregate(a))

see source

this raises TypeError as TypeError: 'type' object is not iterable

which then raises:

if not len(results):
    raise ValueError("no results")

see source

so because we have no results we end up calling _aggregate_generic:

see source

this then calls:

result[name] = self._try_cast(func(data, *args, **kwargs)

see source

This then ends up as:

(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
-> return self._wrap_generic_output(result, obj)

(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}

I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779

So essentially because set doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:

In [8]:

df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]: 
        class_type instructor
user_id                      
1             None       None
2             None       None
3             None       None
4             None       None

but when you use the lambda which is an anonymous function this works as expected.

107

answered Oct 20 '22 00:10

EdChum

Perhaps as @Edchum commented agg applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.

df.groupby('user_id').agg(print,end='\n\n')

 class_type instructor  user_id
0  Krav Maga        Bob        1
4   Ju-jitsu      Alice        1

  class_type instructor  user_id
1       Yoga      Alice        2
5  Krav Maga      Alice        2

  class_type instructor  user_id
2   Ju-jitsu        Bob        3
6     Karate        Bob        3


df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))

0    Krav Maga
4     Ju-jitsu
Name: class_type, dtype: object

1         Yoga
5    Krav Maga
Name: class_type, dtype: object

2    Ju-jitsu
6      Karate
Name: class_type, dtype: object

3    Krav Maga
Name: class_type, dtype: object

...

Hope this is the reason why applying set gave the result like the one mentioned above.

answered Oct 20 '22 00:10

Bharath

Related questions
                            
                                Most elegant approach for writing JSON data to a relational database using Django Models?
                            
                                What are the advantages of concurrent.futures over multiprocessing in Python?
                            
                                Why aren't destructors guaranteed to be called on interpreter exit?
                            
                                What option do I need in setup.py to create the package in the right directory?
                            
                                Removing axes margins in 3D plot
                            
                                Python 3 sorting: Custom comparer removed in favor of key - why?
                            
                                How to align the bar and line in matplotlib two y-axes chart?
                            
                                how to use Flask Jinja2 url_for with multiple parameters
                            
                                Returning two values from pandas.rolling_apply
                            
                                What is the difference between scipy.integrate.odeint and scipy.integrate.ode?
                            
                                Plotly: Grouped Bar Chart with multiple axes
                            
                                Python: insert into list faster than O(N)?
                            
                                what is the IP address of my heroku application
                            
                                flask-sqlalchemy: AttributeError: type object has no attribute 'query', works in ipython
                            
                                Tensorflow `set_random_seed` not working [duplicate]
                            
                                Writing cross-compatible python2/python3 code in pycharm
                            
                                Python pandas linear regression groupby
                            
                                word2vec - what is best? add, concatenate or average word vectors?
                            
                                Dockerfile ADD failed : No Source files were specified
                            
                                pytorch, AttributeError: module 'torch' has no attribute 'Tensor'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

df.groupby(...).agg(set) produces different result compared to df.groupby(...).agg(lambda x: set(x))

Tags:

python

pandas

pandas-groupby

MaxU - stop WAR against UA

People also ask

2 Answers

EdChum

Bharath

Recent Activity

Donate For Us