Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Pandas 'categorical' dtype with sklearn

Is there any support in sklearn to use Panda's Categorical datatype directly in fitting models? From what I've seen sklearn does not support this datatype which is unfortunate because the Categorical datatype both encodes categorical data and contains the mapping scheme of the data. In addition categorical encoding is purely a data handling/processing problem so it seems more natural that it would be handled by Pandas.

Note

I realize there are several methods to encode categorical variables in Pandas and sklearn - that's not what I'm asking about.

like image 386
toes Avatar asked Jun 15 '15 18:06

toes


People also ask

How do pandas handle categorical data?

The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.

What is categorical Dtype pandas?

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values ( categories ; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

How does pandas convert categorical data to numerical data?

Method 1: Using replace() method Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people's salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.

How do you create a categorical variable in pandas?

DataFrame(dtype=”category”) : For creating a categorical dataframe, dataframe() method has dtype attribute set to category. All the columns in data-frame can be converted to categorical either during or after construction by specifying dtype=”category” in the DataFrame constructor.


1 Answers

Cross-posting from the issue-tracker:

I think these are at least two separate questions: 1. can / will sklearn support pandas dataframes with categorical features as input 2. can / will sklearn support operating on categorical variables via pandas categorical datatypes.

  1. would be more or less converting all categorical variables into one-hot encoded features, aka dummy columns. That is really easy to do for the user. We could do that "under the hood" in scikit-learn, but it would complicate the code and I don't see a great benefit.

  2. Is basically impossible. Having a categorical datatype would be nice for the trees, but I think pandas has no stable c-level interface, so we can't really tab into that. Even if there was, it would still require a substantial rewrite of the tree code. I don't think it would be helpful for non-tree estimators.

like image 126
Andreas Mueller Avatar answered Sep 22 '22 08:09

Andreas Mueller