EDITED
Based on the answers so far (thank you), I understand what the CategoricalDType is and what it's used for. What is the Categorical / categorical array for? Does it have a common use case?
--
I don't understand the difference between pd.Categorical and pd.api.types.CategoricalDtype. The latter returns a CategoricalDType instance and the former return a Categories instance. What is a Categorical object? How do they differ? How are they related? When should I use one rather than the other?
type(pd.Categorical(['a','b'],ordered=True))
Out[187]: pandas.core.arrays.categorical.Categorical
type(pd.api.types.CategoricalDtype(['a','b'], ordered=True))
Out[188]: pandas.core.dtypes.dtypes.CategoricalDtype
You can use pd.CategoricalDtype to change the data type of a series to a category.
For example, you have series with string dtype like this:
s = pd.Series(['a', 'a', 'b', 'b'])
and
s.dtype returns:
dtype('O')
Now, you can create a categorical dtype using the following:
s_dtype = pd.api.types.CategoricalDtype(['b','a'], ordered=True)
Then, you can use pd.Series.astype to change that data with a sorting of b < a.
s.astype(s_dtype).sort_values()
Output:
2 b
3 b
0 a
1 a
dtype: category
Categories (2, object): ['b' < 'a']
Where as,
s = pd.Categorical(['a','b'],ordered=True)
is a categorical array constructor.
To complement @Scott's answer, a CategoricalDtype is very useful when you want to maintain a common Categorical across different objects.
Let's consider for example:
s1 = pd.Series(['a', 'a', 'b', 'b'])
s2 = pd.Series(['a', 'c', 'b', 'b'])
If we convert to a generic Categorical and concat, then the resulting Series falls back to object since the categories are not common:
out1 = pd.concat([s1.astype('category'),
s2.astype('category')])
0 a
1 a
2 b
3 b
0 a
1 c
2 b
3 b
dtype: object
Now using a common CategoricalDtype ensures to maintain this dtype after combining the Series:
cat = pd.CategoricalDtype(['a', 'b', 'c'])
out2 = pd.concat([s1.astype(cat),
s2.astype(cat)])
0 a
1 a
2 b
3 b
0 a
1 c
2 b
3 b
dtype: category
Categories (3, object): ['a', 'b', 'c']
Other example:
cat = pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
out = s1.astype(cat) < s2.astype(cat)
0 False
1 True
2 False
3 False
dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With