Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use Category rather than Object?

I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (int32) and the rest of them are categorical.

My question is :

Should I use the dtype('category') of Pandas for the categorical features, or can I let the default dtype('object')?

like image 514
user4640449 Avatar asked Jun 02 '15 16:06

user4640449


People also ask

What is category Dtype?

The category data type in pandas is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more efficiently store the data.

What is DataFrame category?

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values ( categories ; levels in R).

What does PD categorical do?

Categorical are a Pandas data type. A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory. The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”).

How do you change an object to a category in Python?

astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type.


1 Answers

Use a category when there is lots of repetition that you expect to exploit.

For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default object is totally reasonable:

In [6]: %timeit trades.groupby('exch')['size'].sum() 1000 loops, best of 3: 1.25 ms per loop 

But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category:

In [7]: trades['exch'] = trades['exch'].astype('category')  In [8]: %timeit trades.groupby('exch')['size'].sum() 1000 loops, best of 3: 702 µs per loop 

Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.

like image 152
chrisaycock Avatar answered Sep 21 '22 11:09

chrisaycock