MCVE
df = pd.DataFrame({
'Cat': ['SF', 'W', 'F', 'R64', 'SF', 'F'],
'ID': [1, 1, 1, 2, 2, 2]
})
df.Cat = pd.Categorical(
df.Cat, categories=['R64', 'SF', 'F', 'W'], ordered=True)
As you can see, I've define an ordered categorical column on Cat
. To verify, check;
0 SF
1 W
2 F
3 R64
4 SF
5 F
Name: Cat, dtype: category
Categories (4, object): [R64 < SF < F < W]
I want to find the largest category PER ID. Doing groupby
+ max
works.
df.groupby('ID').Cat.max()
ID
1 W
2 F
Name: Cat, dtype: object
But I don't want ID to be the index, so I specify as_index=False
.
df.groupby('ID', as_index=False).Cat.max()
ID Cat
0 1 W
1 2 SF
Oops! Now, the max is taken lexicographically. Can anyone explain whether this is intended behaviour? Or is this a bug?
Note, for this problem, the workaround is df.groupby('ID').Cat.max().reset_index()
.
Note,
>>> pd.__version__
'0.22.0'
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.
Groupby preserves the order of rows within each group.
Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.
This is not intended behavior, it's a bug.
Source diving shows the flag does two completely different things. The one simply ignores grouper levels and names, it just takes the values with a new range index. The other one clearly keeps them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With