From a dataframe with numerical and nominal data:
>>> from pandas import pd
>>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}
>>> df = pd.DataFrame.from_dict(d)
>>> df
Budget m qj
0 39 M1 q23
1 15 M2 q4
2 13 M7 q9
3 53 M1 q23
4 82 M2 q23
5 70 M1 q9
get_dummies convert categorical variable into dummy/indicator variables:
>>> df_dummies = pd.get_dummies(df)
>>> df_dummies
Budget m_M1 m_M2 m_M7 qj_q23 qj_q4 qj_q9
0 39 1 0 0 1 0 0
1 15 0 1 0 0 1 0
2 13 0 0 1 0 0 1
3 53 1 0 0 1 0 0
4 82 0 1 0 1 0 0
5 70 1 0 0 0 0 1
What's the most elegant back_from_dummies way to get back from df_dummies to df ?
>>> (back_from_dummies(df_dummies) == df).all()
Budget True
m True
qj True
dtype: bool
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame . pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_* .
The get_dummies() function from the Pandas library can be used to convert a categorical variable into dummy/indicator variables. It is in a way a static technique for encoding in its behavior.
By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable. If you set drop_first = True , then it will drop the first category. So if you have K categories, it will only produce K – 1 dummy variables.
The get_dummies method of Pandas is another way to create one-hot encoded features. data — the dataframe on which you want to apply one-hot encoding.
idxmax
will do it pretty easily.
from itertools import groupby
def back_from_dummies(df):
result_series = {}
# Find dummy columns and build pairs (category, category_value)
dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]
# Find non-dummy columns that do not have a _
non_dummy_cols = [col for col in df.columns if "_" not in col]
# For each category column group use idxmax to find the value.
for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):
#Select columns for each category
dummy_df = df[[col[1] for col in cols]]
# Find max value among columns
max_columns = dummy_df.idxmax(axis=1)
# Remove category_ prefix
result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])
# Copy non-dummy columns over.
for col in non_dummy_cols:
result_series[col] = df[col]
# Return dataframe of the resulting series
return pd.DataFrame(result_series)
(back_from_dummies(df_dummies) == df).all()
Firstly, seperate the columns:
In [11]: from collections import defaultdict
pos = defaultdict(list)
vals = defaultdict(list)
In [12]: for i, c in enumerate(df_dummies.columns):
if "_" in c:
k, v = c.split("_", 1)
pos[k].append(i)
vals[k].append(v)
else:
pos["_"].append(i)
In [13]: pos
Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]})
In [14]: vals
Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']})
This allows you to slice into the different frames for each dummied column:
In [15]: df_dummies.iloc[:, pos["m"]]
Out[15]:
m_M1 m_M2 m_M7
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0
5 1 0 0
Now we can use numpy's argmax:
In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1)
Out[16]: array([0, 1, 2, 0, 1, 0])
*Note: pandas idxmax returns the label, we want the position so that we can use Categoricals.*
In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"])
Out[17]:
[M1, M2, M7, M1, M2, M1]
Categories (3, object): [M1, M2, M7]
Now we can put this all together:
In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals})
In [22]: df
Out[22]:
m qj
0 M1 q23
1 M2 q4
2 M7 q9
3 M1 q23
4 M2 q23
5 M1 q9
and putting back the non-dummied columns:
In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
In [24]: df
Out[24]:
m qj Budget
0 M1 q23 39
1 M2 q4 15
2 M7 q9 13
3 M1 q23 53
4 M2 q23 82
5 M1 q9 70
As a function:
def reverse_dummy(df_dummies):
pos = defaultdict(list)
vals = defaultdict(list)
for i, c in enumerate(df_dummies.columns):
if "_" in c:
k, v = c.split("_", 1)
pos[k].append(i)
vals[k].append(v)
else:
pos["_"].append(i)
df = pd.DataFrame({k: pd.Categorical.from_codes(
np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
vals[k])
for k in vals})
df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
return df
In [31]: reverse_dummy(df_dummies)
Out[31]:
m qj Budget
0 M1 q23 39
1 M2 q4 15
2 M7 q9 13
3 M1 q23 53
4 M2 q23 82
5 M1 q9 70
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With