Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behaviour with groupby on ordered categorical columns

MCVE

df = pd.DataFrame({
    'Cat': ['SF', 'W', 'F', 'R64', 'SF', 'F'], 
    'ID': [1, 1, 1, 2, 2, 2]
})

df.Cat = pd.Categorical(
    df.Cat, categories=['R64', 'SF', 'F', 'W'], ordered=True)

As you can see, I've define an ordered categorical column on Cat. To verify, check;

0     SF
1      W
2      F
3    R64
4     SF
5      F
Name: Cat, dtype: category
Categories (4, object): [R64 < SF < F < W]

I want to find the largest category PER ID. Doing groupby + max works.

df.groupby('ID').Cat.max()

ID
1    W
2    F
Name: Cat, dtype: object

But I don't want ID to be the index, so I specify as_index=False.

df.groupby('ID', as_index=False).Cat.max()

   ID Cat
0   1   W
1   2  SF

Oops! Now, the max is taken lexicographically. Can anyone explain whether this is intended behaviour? Or is this a bug?

Note, for this problem, the workaround is df.groupby('ID').Cat.max().reset_index().

Note,

>>> pd.__version__
'0.22.0'
like image 579
cs95 Avatar asked Jun 09 '18 21:06

cs95


People also ask

What does DF groupby ([ genre ]) do?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

Does pandas groupby maintain order?

Groupby preserves the order of rows within each group.

Is groupby faster on index pandas?

Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.


1 Answers

This is not intended behavior, it's a bug.

Source diving shows the flag does two completely different things. The one simply ignores grouper levels and names, it just takes the values with a new range index. The other one clearly keeps them.

like image 51
firelynx Avatar answered Oct 12 '22 03:10

firelynx