I came across a strange result while playing around with Pandas and I am not sure why this would work like this. Wondering if it is a bug.
cf = pd.DataFrame({'sc': ['b' , 'b', 'c' , 'd'], 'nn': [1, 2, 3, 4], 'mvl':[10, 20, 30, 40]})
df = cf.groupby('sc').mean()
df.loc['b', 'mvl']
This gives "15.0" as result.
cf1 = cf
cf1['sc'] = cf1['sc'].astype('category', categories=['b', 'c', 'd'], ordered = True)
df1 = cf1.groupby('sc').mean()
df1.loc['b','mvl']
This gives as result a Series:
sc
b 15.0
Name: mvl, dtype: float64
type(df1.loc['b','mvl'])
-> pandas.core.series.Series
type(df.loc['b','mvl'])
-> numpy.float64
Why would declaring the variable as categorical change the output of the loc from a scalar to a Series?
I hope it is not a stupid question. Thanks!
This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get a categorical index. You can see it more simply without any groupby:
nocat = pandas.Series(['a', 'b', 'c'])
cat = nocat.astype('category', categories=['a', 'b', 'c'], ordered=True)
xno = pandas.Series([8, 88, 888], index=nocat)
xcat = pandas.Series([8, 88, 888], index=cat)
>>> xno.loc['a']
8
>>> xcat.loc['a']
a 8
dtype: int64
The docs note that indexing operations on a CategoricalIndex preserve the categorical index. It appears they even do this if you get only one result, which doesn't exactly contradict the docs but seems like undesirable behavior.
There is a related pull request that seems to fix this behavior, but it was only recently merged. It looks like the fix should be in pandas 0.18.1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With