Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Category variable and group by - is this a bug?

Tags:

python

pandas

I came across a strange result while playing around with Pandas and I am not sure why this would work like this. Wondering if it is a bug.

cf = pd.DataFrame({'sc': ['b' , 'b', 'c' , 'd'], 'nn': [1, 2, 3, 4], 'mvl':[10, 20, 30, 40]})
df = cf.groupby('sc').mean()
df.loc['b', 'mvl']

This gives "15.0" as result.

cf1 = cf
cf1['sc'] = cf1['sc'].astype('category', categories=['b', 'c', 'd'], ordered = True)
df1 = cf1.groupby('sc').mean()
df1.loc['b','mvl']

This gives as result a Series:

sc

b    15.0
Name: mvl, dtype: float64

type(df1.loc['b','mvl']) -> pandas.core.series.Series

type(df.loc['b','mvl']) -> numpy.float64

Why would declaring the variable as categorical change the output of the loc from a scalar to a Series?

I hope it is not a stupid question. Thanks!

like image 695
Luk17 Avatar asked Oct 19 '22 07:10

Luk17


1 Answers

This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get a categorical index. You can see it more simply without any groupby:

nocat = pandas.Series(['a', 'b', 'c'])
cat = nocat.astype('category', categories=['a', 'b', 'c'], ordered=True)
xno = pandas.Series([8, 88, 888], index=nocat)
xcat = pandas.Series([8, 88, 888], index=cat)

>>> xno.loc['a']
8
>>> xcat.loc['a']
a    8
dtype: int64

The docs note that indexing operations on a CategoricalIndex preserve the categorical index. It appears they even do this if you get only one result, which doesn't exactly contradict the docs but seems like undesirable behavior.

There is a related pull request that seems to fix this behavior, but it was only recently merged. It looks like the fix should be in pandas 0.18.1.

like image 186
BrenBarn Avatar answered Oct 21 '22 03:10

BrenBarn