Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The most elegant way to get back from pandas.df_dummies

Tags:

python

pandas

From a dataframe with numerical and nominal data:

>>> from pandas import pd
>>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
         'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
         'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}
>>> df = pd.DataFrame.from_dict(d)
>>> df
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

get_dummies convert categorical variable into dummy/indicator variables:

>>> df_dummies = pd.get_dummies(df)
>>> df_dummies
   Budget  m_M1  m_M2  m_M7  qj_q23  qj_q4  qj_q9
0      39     1     0     0       1      0      0
1      15     0     1     0       0      1      0
2      13     0     0     1       0      0      1
3      53     1     0     0       1      0      0
4      82     0     1     0       1      0      0
5      70     1     0     0       0      0      1

What's the most elegant back_from_dummies way to get back from df_dummies to df ?

>>> (back_from_dummies(df_dummies) == df).all()
Budget    True
m         True
qj        True
dtype: bool
like image 899
user3313834 Avatar asked Dec 30 '15 04:12

user3313834


People also ask

Which is the best way to get data in pandas?

pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame . pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_* .

What does Get_dummies do pandas?

The get_dummies() function from the Pandas library can be used to convert a categorical variable into dummy/indicator variables. It is in a way a static technique for encoding in its behavior.

What does Drop_first true do?

By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable. If you set drop_first = True , then it will drop the first category. So if you have K categories, it will only produce K – 1 dummy variables.

Is Get_dummies one hot encoding?

The get_dummies method of Pandas is another way to create one-hot encoded features. data — the dataframe on which you want to apply one-hot encoding.


2 Answers

idxmax will do it pretty easily.

from itertools import groupby

def back_from_dummies(df):
    result_series = {}

    # Find dummy columns and build pairs (category, category_value)
    dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]

    # Find non-dummy columns that do not have a _
    non_dummy_cols = [col for col in df.columns if "_" not in col]

    # For each category column group use idxmax to find the value.
    for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):

        #Select columns for each category
        dummy_df = df[[col[1] for col in cols]]

        # Find max value among columns
        max_columns = dummy_df.idxmax(axis=1)

        # Remove category_ prefix
        result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])

    # Copy non-dummy columns over.
    for col in non_dummy_cols:
        result_series[col] = df[col]

    # Return dataframe of the resulting series
    return pd.DataFrame(result_series)

(back_from_dummies(df_dummies) == df).all()
like image 105
David Maust Avatar answered Oct 07 '22 18:10

David Maust


Firstly, seperate the columns:

In [11]: from collections import defaultdict
         pos = defaultdict(list)
         vals = defaultdict(list)

In [12]: for i, c in enumerate(df_dummies.columns):
             if "_" in c:
                 k, v = c.split("_", 1)
                 pos[k].append(i)
                 vals[k].append(v)
             else:
                 pos["_"].append(i)

In [13]: pos
Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]})

In [14]: vals
Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']})

This allows you to slice into the different frames for each dummied column:

In [15]: df_dummies.iloc[:, pos["m"]]
Out[15]:
   m_M1  m_M2  m_M7
0     1     0     0
1     0     1     0
2     0     0     1
3     1     0     0
4     0     1     0
5     1     0     0

Now we can use numpy's argmax:

In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1)
Out[16]: array([0, 1, 2, 0, 1, 0])

*Note: pandas idxmax returns the label, we want the position so that we can use Categoricals.*

In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"])
Out[17]:
[M1, M2, M7, M1, M2, M1]
Categories (3, object): [M1, M2, M7]

Now we can put this all together:

In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals})

In [22]: df
Out[22]:
    m   qj
0  M1  q23
1  M2   q4
2  M7   q9
3  M1  q23
4  M2  q23
5  M1   q9

and putting back the non-dummied columns:

In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]

In [24]: df
Out[24]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

As a function:

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

In [31]: reverse_dummy(df_dummies)
Out[31]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70
like image 27
Andy Hayden Avatar answered Oct 07 '22 16:10

Andy Hayden