From a dataframe with numerical and nominal data: <pre class="prettyprint"><code>>>> from pandas import pd >>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'}, 'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'}, 'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}} >>> df = pd.DataFrame.from_dict(d) >>> df Budget m qj 0 39 M1 q23 1 15 M2 q4 2 13 M7 q9 3 53 M1 q23 4 82 M2 q23 5 70 M1 q9 </code></pre> get_dummies convert categorical variable into dummy/indicator variables: <pre class="prettyprint"><code>>>> df_dummies = pd.get_dummies(df) >>> df_dummies Budget m_M1 m_M2 m_M7 qj_q23 qj_q4 qj_q9 0 39 1 0 0 1 0 0 1 15 0 1 0 0 1 0 2 13 0 0 1 0 0 1 3 53 1 0 0 1 0 0 4 82 0 1 0 1 0 0 5 70 1 0 0 0 0 1 </code></pre> What's the most elegant back_from_dummies way to get back from df_dummies to df ? <pre class="prettyprint"><code>>>> (back_from_dummies(df_dummies) == df).all() Budget True m True qj True dtype: bool </code></pre>

Firstly, seperate the columns: <pre class="prettyprint"><code>In [11]: from collections import defaultdict pos = defaultdict(list) vals = defaultdict(list) In [12]: for i, c in enumerate(df_dummies.columns): if "_" in c: k, v = c.split("_", 1) pos[k].append(i) vals[k].append(v) else: pos["_"].append(i) In [13]: pos Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]}) In [14]: vals Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']}) </code></pre> This allows you to slice into the different frames for each dummied column: <pre class="prettyprint"><code>In [15]: df_dummies.iloc[:, pos["m"]] Out[15]: m_M1 m_M2 m_M7 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 4 0 1 0 5 1 0 0 </code></pre> Now we can use numpy's argmax: <pre class="prettyprint"><code>In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1) Out[16]: array([0, 1, 2, 0, 1, 0]) </code></pre> *Note: pandas idxmax returns the label, we want the position so that we can use Categoricals.* <pre class="prettyprint"><code>In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"]) Out[17]: [M1, M2, M7, M1, M2, M1] Categories (3, object): [M1, M2, M7] </code></pre> Now we can put this all together: <pre class="prettyprint"><code>In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals}) In [22]: df Out[22]: m qj 0 M1 q23 1 M2 q4 2 M7 q9 3 M1 q23 4 M2 q23 5 M1 q9 </code></pre> and putting back the non-dummied columns: <pre class="prettyprint"><code>In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]] In [24]: df Out[24]: m qj Budget 0 M1 q23 39 1 M2 q4 15 2 M7 q9 13 3 M1 q23 53 4 M2 q23 82 5 M1 q9 70 </code></pre> <hr> As a function: <pre class="prettyprint"><code>def reverse_dummy(df_dummies): pos = defaultdict(list) vals = defaultdict(list) for i, c in enumerate(df_dummies.columns): if "_" in c: k, v = c.split("_", 1) pos[k].append(i) vals[k].append(v) else: pos["_"].append(i) df = pd.DataFrame({k: pd.Categorical.from_codes( np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals}) df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]] return df In [31]: reverse_dummy(df_dummies) Out[31]: m qj Budget 0 M1 q23 39 1 M2 q4 15 2 M7 q9 13 3 M1 q23 53 4 M2 q23 82 5 M1 q9 70 </code></pre>

The most elegant way to get back from pandas.df_dummies

Tags:

python

pandas

From a dataframe with numerical and nominal data:

>>> from pandas import pd
>>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
         'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
         'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}
>>> df = pd.DataFrame.from_dict(d)
>>> df
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

get_dummies convert categorical variable into dummy/indicator variables:

>>> df_dummies = pd.get_dummies(df)
>>> df_dummies
   Budget  m_M1  m_M2  m_M7  qj_q23  qj_q4  qj_q9
0      39     1     0     0       1      0      0
1      15     0     1     0       0      1      0
2      13     0     0     1       0      0      1
3      53     1     0     0       1      0      0
4      82     0     1     0       1      0      0
5      70     1     0     0       0      0      1

What's the most elegant back_from_dummies way to get back from df_dummies to df ?

>>> (back_from_dummies(df_dummies) == df).all()
Budget    True
m         True
qj        True
dtype: bool

899

asked Dec 30 '15 04:12

user3313834

2 Answers

idxmax will do it pretty easily.

from itertools import groupby

def back_from_dummies(df):
    result_series = {}

    # Find dummy columns and build pairs (category, category_value)
    dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]

    # Find non-dummy columns that do not have a _
    non_dummy_cols = [col for col in df.columns if "_" not in col]

    # For each category column group use idxmax to find the value.
    for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):

        #Select columns for each category
        dummy_df = df[[col[1] for col in cols]]

        # Find max value among columns
        max_columns = dummy_df.idxmax(axis=1)

        # Remove category_ prefix
        result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])

    # Copy non-dummy columns over.
    for col in non_dummy_cols:
        result_series[col] = df[col]

    # Return dataframe of the resulting series
    return pd.DataFrame(result_series)

(back_from_dummies(df_dummies) == df).all()

105

answered Oct 07 '22 18:10

David Maust

Firstly, seperate the columns:

In [11]: from collections import defaultdict
         pos = defaultdict(list)
         vals = defaultdict(list)

In [12]: for i, c in enumerate(df_dummies.columns):
             if "_" in c:
                 k, v = c.split("_", 1)
                 pos[k].append(i)
                 vals[k].append(v)
             else:
                 pos["_"].append(i)

In [13]: pos
Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]})

In [14]: vals
Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']})

This allows you to slice into the different frames for each dummied column:

In [15]: df_dummies.iloc[:, pos["m"]]
Out[15]:
   m_M1  m_M2  m_M7
0     1     0     0
1     0     1     0
2     0     0     1
3     1     0     0
4     0     1     0
5     1     0     0

Now we can use numpy's argmax:

In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1)
Out[16]: array([0, 1, 2, 0, 1, 0])

*Note: pandas idxmax returns the label, we want the position so that we can use Categoricals.*

In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"])
Out[17]:
[M1, M2, M7, M1, M2, M1]
Categories (3, object): [M1, M2, M7]

Now we can put this all together:

In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals})

In [22]: df
Out[22]:
    m   qj
0  M1  q23
1  M2   q4
2  M7   q9
3  M1  q23
4  M2  q23
5  M1   q9

and putting back the non-dummied columns:

In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]

In [24]: df
Out[24]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

As a function:

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

In [31]: reverse_dummy(df_dummies)
Out[31]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

answered Oct 07 '22 16:10

Andy Hayden

Related questions
                            
                                BeautifulSoup replaceWith() method adding escaped html, want it unescaped
                            
                                Astropy matplotlib and plot galactic coordinates
                            
                                Compile Latex file using a python script
                            
                                Navigation with BeautifulSoup
                            
                                pipe python logging stdout stream output to grep
                            
                                Binding routes when using an app factory
                            
                                Why does saving mat files with scipy result in larger file size than with Matlab?
                            
                                Geoip2's python library doesn't work in pySpark's map function
                            
                                Send table in pywin32 outlook email
                            
                                Kivy Apk build with buildozer error: # Java compiler (javac) not found, please install it
                            
                                Set height and width of figure created with plt.subplots in matplotlib?
                            
                                Unable to build blender
                            
                                Index a numpy array with another array
                            
                                pysftp AuthenticationException while connecting to server with private key
                            
                                How to use Django templating as is without server
                            
                                Python re can't split zero-width anchors? [duplicate]
                            
                                Artifacts in convolution
                            
                                Import functions directly from Python 3 modules
                            
                                Case-insensitive search on a postgres ArrayField with django
                            
                                tensorflow: saving and restoring session

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With