Dummy variables when not all categories are present

Tags:

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.

My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Something that would make this:

categories = ['a', 'b', 'c']     cat 1   a 2   b 3   a

Become this:

  cat_a  cat_b  cat_c 1   1      0      0 2   0      1      0 3   1      0      0

825

asked May 25 '16 00:05

Berne

2 Answers

TL;DR:

pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))

Older pandas: pd.get_dummies(cat.astype('category', categories=categories))

is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies takes into account. Here's an example:

In [1]: import pandas as pd  In [2]: possible_categories = list('abc')  In [3]: cat = pd.Series(list('aba'))  In [4]: cat = cat.astype(pd.CategoricalDtype(categories=possible_categories))  In [5]: cat Out[5]:  0    a 1    b 2    a dtype: category Categories (3, object): [a, b, c]

Then, get_dummies will do exactly what you want!

In [6]: pd.get_dummies(cat) Out[6]:     a  b  c 0  1  0  0 1  0  1  0 2  1  0  0

There are a bunch of other ways to create a categorical Series or DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.

EDIT:

I haven't followed the exact versioning, but there was a bug in how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).

For version 0.17.0, if you try to do this with the sparse=True option with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.

It looks like pandas 0.21.0 added a CategoricalDType, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.

181

answered Sep 20 '22 14:09

T.C. Proctor

Using transpose and reindex

import pandas as pd  cats = ['a', 'b', 'c'] df = pd.DataFrame({'cat': ['a', 'b', 'a']})  dummies = pd.get_dummies(df, prefix='', prefix_sep='') dummies = dummies.T.reindex(cats).T.fillna(0)  print dummies      a    b    c 0  1.0  0.0  0.0 1  0.0  1.0  0.0 2  1.0  0.0  0.0

answered Sep 17 '22 14:09

piRSquared

Related questions
                            
                                High Memory Usage Using Python Multiprocessing
                            
                                How to do Decimal to float conversion in Python?
                            
                                How to automatically destroy django test database
                            
                                How can I use io.StringIO() with the csv module?
                            
                                How to access sparse matrix elements?
                            
                                Python mock call_args_list unpacking tuples for assertion on arguments
                            
                                Scope of variable within "with" statement?
                            
                                Pandas isna() and isnull(), what is the difference?
                            
                                How to group DataFrame by a period of time?
                            
                                Django persistent database connection
                            
                                BeautifulSoup innerhtml?
                            
                                Use Python format string in reverse for parsing
                            
                                How to extend an array in-place in Numpy?
                            
                                Iterate over individual bytes in Python 3
                            
                                coercing to Unicode: need string or buffer, NoneType found when rendering in django admin
                            
                                How do I close an image opened in Pillow?
                            
                                check if numpy array is multidimensional or not
                            
                                How to freeze packages installed only in the virtual environment?
                            
                                Parallel Coordinates plot in Matplotlib
                            
                                Matplotlib: avoiding overlapping datapoints in a "scatter/dot/beeswarm" plot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dummy variables when not all categories are present

Tags:

python

pandas

machine-learning

dummy-variable

Berne

People also ask

2 Answers

T.C. Proctor

piRSquared

Recent Activity

Donate For Us