Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas difference between `.astype('category') and `pd.Categorical(...)`

Tags:

python

pandas

I have a dataset with a string column (name: 14) that I want to convert to interpret as a categorical feature. As far as I know there're two ways to do that:

pd.Categorical(data[14])
data[14].astype('category')

While both of these produce result with the same .dtype: CategoricalDtype(categories=[' <=50K', ' >50K'], ordered=False) they're not the same.

Calling .describe() on the results they produce different outputs. The first one outputs information about individual categories while the second one (astype(..)) results in typical describe output with count, unique, top, freq, and name, listing dtype: object.

My question is, then, why / how do they differ?


It's this dataset: http://archive.ics.uci.edu/ml/datasets/Adult

data = pd.read_csv("./adult/adult.data", header=None)

pd.Categorical(data[14]).describe()
data[14].astype('category').describe()

pd.Categorical(data[14]).dtype
data[14].astype('category').dtype
like image 896
Petrroll Avatar asked Oct 19 '25 05:10

Petrroll


1 Answers

As Bakuriu points out, type(pd.Categorical(data[14])) is Categorical, while type(data[14].astype('category')) is Series:

import pandas as pd
data = pd.read_csv("./adult/adult.data", header=None)

cat = pd.Categorical(data[14])
ser = data[14].astype('category')
print(type(cat))
# pandas.core.arrays.categorical.Categorical
print(type(ser))
# pandas.core.series.Series

The behavior of describe() differs because Categorical.describe is defined differently than Series.describe.

Whenever you call Categorical.describe(), you'll get count and freq per category:

In [174]: cat.describe()
Out[174]: 
            counts    freqs
categories                 
 <=50K       24720  0.75919
 >50K         7841  0.24081

and whenever you call Series.describe() on a categorical Series, you'll get count, unique, top and freq. Note that count and freq have a different meaning here too:

In [175]: ser.describe()
Out[175]: 
count      32561
unique         2
top        <=50K
freq       24720
Name: 14, dtype: object
like image 196
unutbu Avatar answered Oct 21 '25 19:10

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!