Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between pd.Categorical and pd.api.types.CategoricalDtype

EDITED

Based on the answers so far (thank you), I understand what the CategoricalDType is and what it's used for. What is the Categorical / categorical array for? Does it have a common use case?

--

I don't understand the difference between pd.Categorical and pd.api.types.CategoricalDtype. The latter returns a CategoricalDType instance and the former return a Categories instance. What is a Categorical object? How do they differ? How are they related? When should I use one rather than the other?

type(pd.Categorical(['a','b'],ordered=True))
Out[187]: pandas.core.arrays.categorical.Categorical

type(pd.api.types.CategoricalDtype(['a','b'], ordered=True))
Out[188]: pandas.core.dtypes.dtypes.CategoricalDtype
like image 539
PKB Avatar asked Oct 24 '25 05:10

PKB


2 Answers

You can use pd.CategoricalDtype to change the data type of a series to a category.

For example, you have series with string dtype like this:

s = pd.Series(['a', 'a', 'b', 'b'])

and

s.dtype returns:

dtype('O')

Now, you can create a categorical dtype using the following:

s_dtype = pd.api.types.CategoricalDtype(['b','a'], ordered=True)

Then, you can use pd.Series.astype to change that data with a sorting of b < a.

s.astype(s_dtype).sort_values()

Output:

2    b
3    b
0    a
1    a
dtype: category
Categories (2, object): ['b' < 'a']

Where as,

s = pd.Categorical(['a','b'],ordered=True)

is a categorical array constructor.

like image 63
Scott Boston Avatar answered Oct 27 '25 02:10

Scott Boston


To complement @Scott's answer, a CategoricalDtype is very useful when you want to maintain a common Categorical across different objects.

Let's consider for example:

s1 = pd.Series(['a', 'a', 'b', 'b'])
s2 = pd.Series(['a', 'c', 'b', 'b'])

If we convert to a generic Categorical and concat, then the resulting Series falls back to object since the categories are not common:

out1 = pd.concat([s1.astype('category'),
                  s2.astype('category')])

0    a
1    a
2    b
3    b
0    a
1    c
2    b
3    b
dtype: object

Now using a common CategoricalDtype ensures to maintain this dtype after combining the Series:

cat = pd.CategoricalDtype(['a', 'b', 'c'])
out2 = pd.concat([s1.astype(cat),
                  s2.astype(cat)])

0    a
1    a
2    b
3    b
0    a
1    c
2    b
3    b
dtype: category
Categories (3, object): ['a', 'b', 'c']

Other example:

cat = pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
out = s1.astype(cat) < s2.astype(cat)

0    False
1     True
2    False
3    False
dtype: bool
like image 43
mozway Avatar answered Oct 27 '25 03:10

mozway