<code>pd.get_dummies</code> allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?

It's been a few years, so this may well not have been in the <code>pandas</code> toolkit back when this question was originally asked, but this approach seems a little easier to me. <code>idxmax</code> will return the index corresponding to the largest element (i.e. the one with a <code>1</code>). We do <code>axis=1</code> because we want the column name where the <code>1</code> occurs. EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with <code>pd.Categorical</code> (and <code>pd.Series</code>, if desired). <pre class="prettyprint"><code>In [1]: import pandas as pd In [2]: s = pd.Series(['a', 'b', 'a', 'c']) In [3]: s Out[3]: 0 a 1 b 2 a 3 c dtype: object In [4]: dummies = pd.get_dummies(s) In [5]: dummies Out[5]: a b c 0 1 0 0 1 0 1 0 2 1 0 0 3 0 0 1 In [6]: s2 = dummies.idxmax(axis=1) In [7]: s2 Out[7]: 0 a 1 b 2 a 3 c dtype: object In [8]: (s2 == s).all() Out[8]: True </code></pre> EDIT in response to @piRSquared's comment: This solution does indeed assume there's one <code>1</code> per row. I think this is usually the format one has. <code>pd.get_dummies</code> can return rows that are all 0 if you have <code>drop_first=True</code> or if there are <code>NaN</code> values and <code>dummy_na=False</code> (default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. <code>a</code> in the example above). If <code>drop_first=True</code>, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving <code>drop_first=False</code> (default). Since <code>dummy_na=False</code> is the default, this could certainly cause problems. Please set <code>dummy_na=True</code> when you call <code>pd.get_dummies</code> if you want to use this solution to invert the "dummification" and your data contains any <code>NaNs</code>. Setting <code>dummy_na=True</code> will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have <code>NaN</code>s. A nice approach might be to set <code>dummies = pd.get_dummies(series, dummy_na=series.isnull().any())</code>. What's also nice is that <code>idxmax</code> solution will correctly regenerate your <code>NaN</code>s (not just a string that says "nan"). It's also worth mentioning that setting <code>drop_first=True</code> and <code>dummy_na=False</code> means that <code>NaN</code>s become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any <code>NaN</code> values.

Reconstruct a categorical variable from dummies in pandas

1 Answers

It's been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.

EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).

In [1]: import pandas as pd  In [2]: s = pd.Series(['a', 'b', 'a', 'c'])  In [3]: s Out[3]:  0    a 1    b 2    a 3    c dtype: object  In [4]: dummies = pd.get_dummies(s)  In [5]: dummies Out[5]:     a  b  c 0  1  0  0 1  0  1  0 2  1  0  0 3  0  0  1  In [6]: s2 = dummies.idxmax(axis=1)  In [7]: s2 Out[7]:  0    a 1    b 2    a 3    c dtype: object  In [8]: (s2 == s).all() Out[8]: True

EDIT in response to @piRSquared's comment: This solution does indeed assume there's one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).

If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False (default).

Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the "dummification" and your data contains any NaNs. Setting dummy_na=True will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says "nan").

It's also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.

answered Sep 20 '22 10:09

Nathan

Related questions
                            
                                How to assert a dict contains another dict without assertDictContainsSubset in python? [duplicate]
                            
                                Python Pandas to_sql, how to create a table with a primary key?
                            
                                How to install pandas from pip on windows cmd?
                            
                                How do I append a string to a Path in Python?
                            
                                Debugging a pyQT4 app?
                            
                                Can I overwrite the string form of a namedtuple?
                            
                                Matplotlib plot with variable line width
                            
                                Grouping tests in pytest: Classes vs plain functions
                            
                                Using py.test with coverage doesn't include imports
                            
                                How to add percentages on top of bars in seaborn
                            
                                Redirect Python 'print' output to Logger
                            
                                conda stuck on Proceed ([y]/n)? when updating packages in ipython console
                            
                                Why is this loop faster than a dictionary comprehension for creating a dictionary?
                            
                                How to convert a pandas DataFrame subset of columns AND rows into a numpy array?
                            
                                Compare two files report difference in python
                            
                                Suppress newline in Python logging module
                            
                                Python title() with apostrophes
                            
                                Invoke Python SimpleHTTPServer from command line with no cache option
                            
                                How to get the values from a NumPy array using multiple indices
                            
                                How to remove a field from the parent Form in a subclass?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reconstruct a categorical variable from dummies in pandas

Tags:

python

pandas

themiurgo

People also ask

1 Answers

Nathan

Recent Activity

Donate For Us