Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dummy variables when not all categories are present

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.

My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Something that would make this:

categories = ['a', 'b', 'c']     cat 1   a 2   b 3   a 

Become this:

  cat_a  cat_b  cat_c 1   1      0      0 2   0      1      0 3   1      0      0 
like image 825
Berne Avatar asked May 25 '16 00:05

Berne


People also ask

What are the limitations of dummy variables?

In a model with many dummy variables, a lot of sets will be useless for generating estimates of coefficients. Because dummy variables reduce the amount of available data, the estimator's breakdown point necessarily deteriorates.

Where dummy variables should be used?

Typically, dummy variables are used in the following applications: time series analysis with seasonality or regime switching; analysis of qualitative data, such as survey responses; categorical representation, and representation of value levels.

How many dummy variables are needed for 3 categories?

The general rule is to use one fewer dummy variables than categories. So for quarterly data, use three dummy variables; for monthly data, use 11 dummy variables; and for daily data, use six dummy variables, and so on.

How many dummy variables are required to represent the categorical variable?

One dummy variable is required to represent the categorical variables.


2 Answers

TL;DR:

pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories))) 
  • Older pandas: pd.get_dummies(cat.astype('category', categories=categories))

is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies takes into account. Here's an example:

In [1]: import pandas as pd  In [2]: possible_categories = list('abc')  In [3]: cat = pd.Series(list('aba'))  In [4]: cat = cat.astype(pd.CategoricalDtype(categories=possible_categories))  In [5]: cat Out[5]:  0    a 1    b 2    a dtype: category Categories (3, object): [a, b, c] 

Then, get_dummies will do exactly what you want!

In [6]: pd.get_dummies(cat) Out[6]:     a  b  c 0  1  0  0 1  0  1  0 2  1  0  0 

There are a bunch of other ways to create a categorical Series or DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.

EDIT:

I haven't followed the exact versioning, but there was a bug in how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).

For version 0.17.0, if you try to do this with the sparse=True option with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.

It looks like pandas 0.21.0 added a CategoricalDType, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.

like image 181
T.C. Proctor Avatar answered Sep 20 '22 14:09

T.C. Proctor


Using transpose and reindex

import pandas as pd  cats = ['a', 'b', 'c'] df = pd.DataFrame({'cat': ['a', 'b', 'a']})  dummies = pd.get_dummies(df, prefix='', prefix_sep='') dummies = dummies.T.reindex(cats).T.fillna(0)  print dummies      a    b    c 0  1.0  0.0  0.0 1  0.0  1.0  0.0 2  1.0  0.0  0.0 
like image 25
piRSquared Avatar answered Sep 17 '22 14:09

piRSquared