Pandas

Question

I came across a strange result while playing around with Pandas and I am not sure why this would work like this. Wondering if it is a bug.

cf = pd.DataFrame({'sc': ['b' , 'b', 'c' , 'd'], 'nn': [1, 2, 3, 4], 'mvl':[10, 20, 30, 40]})
df = cf.groupby('sc').mean()
df.loc['b', 'mvl']

This gives "15.0" as result.

cf1 = cf
cf1['sc'] = cf1['sc'].astype('category', categories=['b', 'c', 'd'], ordered = True)
df1 = cf1.groupby('sc').mean()
df1.loc['b','mvl']

This gives as result a Series:

sc

b    15.0
Name: mvl, dtype: float64

type(df1.loc['b','mvl']) -> pandas.core.series.Series

type(df.loc['b','mvl']) -> numpy.float64

Why would declaring the variable as categorical change the output of the loc from a scalar to a Series?

I hope it is not a stupid question. Thanks!

BrenBarn · Accepted Answer

This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get a categorical index. You can see it more simply without any groupby:

nocat = pandas.Series(['a', 'b', 'c'])
cat = nocat.astype('category', categories=['a', 'b', 'c'], ordered=True)
xno = pandas.Series([8, 88, 888], index=nocat)
xcat = pandas.Series([8, 88, 888], index=cat)

>>> xno.loc['a']
8
>>> xcat.loc['a']
a    8
dtype: int64

The docs note that indexing operations on a CategoricalIndex preserve the categorical index. It appears they even do this if you get only one result, which doesn't exactly contradict the docs but seems like undesirable behavior.

There is a related pull request that seems to fix this behavior, but it was only recently merged. It looks like the fix should be in pandas 0.18.1.

Pandas - Category variable and group by - is this a bug?

Tags:

python

Luk17

1 Answers

BrenBarn

Recent Activity

Donate For Us