Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get pandas get_dummies to emit N-1 variables to avoid collinearity?

Tags:

pandas.get_dummies emits a dummy variable per categorical value. Is there some automated, easy way to ask it to create only N-1 dummy variables? (just get rid of one "baseline" variable arbitrarily)?

Needed to avoid co-linearity in our dataset.

like image 497
ihadanny Avatar asked Jul 19 '15 05:07

ihadanny


People also ask

What does Drop_first do in Get_dummies?

get_dummies there is a parameter i.e. drop_first allows you whether to keep or remove the reference (whether to keep k or k-1 dummies out of k categorical levels).

What does Get_dummies do in pandas?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

How can we handle dummy variable trap?

To overcome the Dummy variable Trap, we drop one of the columns created when the categorical variables were converted to dummy variables by one-hot encoding. This can be done because the dummy variables include redundant information.

When should you not use dummies?

get_dummies() returns a pandas data frame with clean column names after feature encoding, still, it's not recommended to use it for production or in Kaggle competitions. One-hot encoded or Count Vectorizer strategies should be preferred as it carries the characteristics of feature values for each categorical features.


2 Answers

Pandas version 0.18.0 implemented exactly what you're looking for: the drop_first option. Here's an example:

In [1]: import pandas as pd  In [2]: pd.__version__ Out[2]: u'0.18.1'  In [3]: s = pd.Series(list('abcbacb'))  In [4]: pd.get_dummies(s, drop_first=True) Out[4]:       b    c 0  0.0  0.0 1  1.0  0.0 2  0.0  1.0 3  1.0  0.0 4  0.0  0.0 5  0.0  1.0 6  1.0  0.0 
like image 121
T.C. Proctor Avatar answered Oct 13 '22 20:10

T.C. Proctor


There are a number of ways of doing so.

Possibly the simplest is replacing one of the values by None before calling get_dummies. Say you have:

import pandas as pd import numpy as np s = pd.Series(list('babca')) >> s 0    b 1    a 2    b 3    c 4    a 

Then use:

>> pd.get_dummies(np.where(s == s.unique()[0], None, s))     a   c 0   0   0 1   1   0 2   0   0 3   0   1 4   1   0 

to drop b.

(Of course, you need to consider if your category column doesn't already contain None.)


Another way is to use the prefix argument to get_dummies:

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False)

prefix: string, list of strings, or dict of strings, default None - String to append DataFrame column names Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternativly, prefix can be a dictionary mapping column names to prefixes.

This will append some prefix to all of the resulting columns, and you can then erase one of the columns with this prefix (just make it unique).

like image 21
Ami Tavory Avatar answered Oct 13 '22 19:10

Ami Tavory