Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is pandas.DataFrame.groupby Guaranteed To Be Stable?

I've noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well.

I think there are several answers clearly implicitly using this, but, to be concrete, here is one using groupby+cumsum.

Is there anything actually promising this behavior? The documentation only states:

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).

like image 834
Ami Tavory Avatar asked Sep 07 '16 15:09

Ami Tavory


People also ask

Does Groupby maintain order pandas?

Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

Does Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.

When should I use a Groupby in pandas?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.

What is the difference between Groupby and Pivot_table in pandas?

What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.


2 Answers

Yes; the description of the sort parameter of DataFrame.groupby now promises that groupby (with or without key sorting) "preserves the order of rows within each group":

sort : bool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

like image 198
teichert Avatar answered Oct 21 '22 20:10

teichert


Although the docs don't state this internally, it uses stable sort when generating the groups.

See:

  • https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291
  • https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356

As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:

_algos.groupsort_indexer implements counting sort and it is at least O(ngroups), where

ngroups = prod(shape)

shape = map(len, keys)

That is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. np.argsort(kind='mergesort') is O(count x log(count)) where count is the length of the data-frame; Both algorithms are stable sort and that is necessary for correctness of groupby operations.

e.g. consider: df.groupby(key)[col].transform('first')

like image 41
EdChum Avatar answered Oct 21 '22 20:10

EdChum