I've noticed that there are several uses of <code>pd.DataFrame.groupby</code> followed by an <code>apply</code> implicitly assuming that <code>groupby</code> is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well. I think there are several answers clearly implicitly using this, but, to be concrete, here is one using <code>groupby</code>+<code>cumsum</code>. Is there anything actually promising this behavior? The documentation only states: <blockquote> Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. </blockquote> Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).

Yes; the description of the <code>sort</code> parameter of <code>DataFrame.groupby</code> now promises that <code>groupby</code> (with or without key sorting) "preserves the order of rows within each group": <blockquote> sort : bool, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group. </blockquote>

Although the docs don't state this internally, it uses stable sort when generating the groups. See: <ul> <li> https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291 </li> <li>https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356</li> </ul> As I mentioned in the comments, this is important if you consider <code>transform</code> which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments: <blockquote> <code>_algos.groupsort_indexer</code> implements counting sort and it is at least <code>O(ngroups)</code>, where <code>ngroups = prod(shape)</code> <code>shape = map(len, keys)</code> That is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. <code>np.argsort(kind='mergesort')</code> is <code>O(count x log(count))</code> where count is the length of the data-frame; Both algorithms are stable sort and that is necessary for correctness of groupby operations. e.g. consider: <code>df.groupby(key)[col].transform('first')</code> </blockquote>

Is pandas.DataFrame.groupby Guaranteed To Be Stable?

Tags:

python

pandas

language-lawyer

group-by

I've noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well.

I think there are several answers clearly implicitly using this, but, to be concrete, here is one using groupby+cumsum.

Is there anything actually promising this behavior? The documentation only states:

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).

834

asked Sep 07 '16 15:09

Ami Tavory

2 Answers

Yes; the description of the sort parameter of DataFrame.groupby now promises that groupby (with or without key sorting) "preserves the order of rows within each group":

sort : bool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

198

answered Oct 21 '22 20:10

teichert

Although the docs don't state this internally, it uses stable sort when generating the groups.

See:

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356

As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:

_algos.groupsort_indexer implements counting sort and it is at least O(ngroups), where

ngroups = prod(shape)

shape = map(len, keys)

That is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. np.argsort(kind='mergesort') is O(count x log(count)) where count is the length of the data-frame; Both algorithms are stable sort and that is necessary for correctness of groupby operations.

e.g. consider: df.groupby(key)[col].transform('first')

answered Oct 21 '22 20:10

EdChum

Related questions
                            
                                Different behavior in python script and python idle?
                            
                                Flask CORS - no Access-control-allow-origin header present on a redirect()
                            
                                Why use __unicode__(self) method for django 1.7+? [closed]
                            
                                In PyQt, what is the best way to share data between the main window and a thread
                            
                                How do I correctly inherit templates in flask that use bootstrap?
                            
                                How do I get union keys of `a` and `b` dictionary and 'a' values? [duplicate]
                            
                                How do I update Kivy elements from a thread?
                            
                                What is the best way for a class to reference itself in a class attribute?
                            
                                How to register "atexit" function in python's multiprocessing subprocess?
                            
                                Keep the order of list in sql pagination
                            
                                An Object is created twice in Python
                            
                                Paraview: Changing aspect ratio of axes in rendering window
                            
                                How to get exit code from subprocess.Popen?
                            
                                Python pandas: Add column to grouped DataFrame with method chaining
                            
                                Recursively search for parent child combinations and build tree in python and XML
                            
                                Pycharm code reformatting: align lines by operator
                            
                                Django Prefetch with custom queryset which uses managers method
                            
                                Pandas - Explanation on apply function being slow
                            
                                how to throttle a large number of tasks without using all workers
                            
                                What is calculator mode?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With