pd.get_dummies
allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
A dummy variable (aka, an indicator variable) is a numeric variable that represents categorical data, such as gender, race, political affiliation, etc.
There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...
To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd.
It's been a few years, so this may well not have been in the pandas
toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax
will return the index corresponding to the largest element (i.e. the one with a 1
). We do axis=1
because we want the column name where the 1
occurs.
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical
(and pd.Series
, if desired).
In [1]: import pandas as pd In [2]: s = pd.Series(['a', 'b', 'a', 'c']) In [3]: s Out[3]: 0 a 1 b 2 a 3 c dtype: object In [4]: dummies = pd.get_dummies(s) In [5]: dummies Out[5]: a b c 0 1 0 0 1 0 1 0 2 1 0 0 3 0 0 1 In [6]: s2 = dummies.idxmax(axis=1) In [7]: s2 Out[7]: 0 a 1 b 2 a 3 c dtype: object In [8]: (s2 == s).all() Out[8]: True
EDIT in response to @piRSquared's comment: This solution does indeed assume there's one 1
per row. I think this is usually the format one has. pd.get_dummies
can return rows that are all 0 if you have drop_first=True
or if there are NaN
values and dummy_na=False
(default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a
in the example above).
If drop_first=True
, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False
(default).
Since dummy_na=False
is the default, this could certainly cause problems. Please set dummy_na=True
when you call pd.get_dummies
if you want to use this solution to invert the "dummification" and your data contains any NaNs
. Setting dummy_na=True
will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaN
s. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any())
. What's also nice is that idxmax
solution will correctly regenerate your NaN
s (not just a string that says "nan").
It's also worth mentioning that setting drop_first=True
and dummy_na=False
means that NaN
s become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN
values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With