Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reconstruct a categorical variable from dummies in pandas

Tags:

python

pandas

pd.get_dummies allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?

like image 566
themiurgo Avatar asked Nov 05 '14 16:11

themiurgo


People also ask

Can a dummy variable be categorical?

A dummy variable (aka, an indicator variable) is a numeric variable that represents categorical data, such as gender, race, political affiliation, etc.

How do you create a dummy variable for categorical variables?

There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...

How do you generate dummy variables in Python for categorical variables?

To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd.


1 Answers

It's been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.

EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).

In [1]: import pandas as pd  In [2]: s = pd.Series(['a', 'b', 'a', 'c'])  In [3]: s Out[3]:  0    a 1    b 2    a 3    c dtype: object  In [4]: dummies = pd.get_dummies(s)  In [5]: dummies Out[5]:     a  b  c 0  1  0  0 1  0  1  0 2  1  0  0 3  0  0  1  In [6]: s2 = dummies.idxmax(axis=1)  In [7]: s2 Out[7]:  0    a 1    b 2    a 3    c dtype: object  In [8]: (s2 == s).all() Out[8]: True 

EDIT in response to @piRSquared's comment: This solution does indeed assume there's one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).

If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False (default).

Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the "dummification" and your data contains any NaNs. Setting dummy_na=True will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says "nan").

It's also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.

like image 87
Nathan Avatar answered Sep 20 '22 10:09

Nathan