Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between dummy variable and one-hot encoding?

I'm making features for a machine learning model. I'm confused with dummy variable and one-hot encoding.For a instance,a category variable 'week' range 1-7.When using one-hot encoding, encode week = 1 as 1,000,000,week = 2 is 0,100,000... .But I can also make a dummy variable 'week_v',and in this way, I must set a hidden variable which means base variable,and feature week_v = 1 is 100,000,week_v = 2 is 010,000... and does not appear week_v = 7.So what's the difference between them? I'm using logistic model and then I'll try gbdt.

like image 934
Peng He Avatar asked Dec 14 '16 07:12

Peng He


People also ask

What is dummy variable trap one-hot encoding?

What is the Dummy Variable Trap? The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models.

Is get Dummies one-hot encoding?

The Pandas get dummies function, pd. get_dummies() , allows you to easily one-hot encode your categorical data.

What is the disadvantage of one-hot encoding?

Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy. In addition, you may wish to transform the values back to categorical form so that they may be displayed in your application.

What is a hot encoding?

One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category.


Video Answer


2 Answers

In fact, there is no difference in the effect of the two approaches (rather wordings) on your regression.

In either case, you have to make sure that one of your dummies is left out (i.e. serves as base assumption) to avoid perfect multicollinearity among the set.

For instance, if you want to take the weekday of an observation into account, you only use 6 (not 7) dummies assuming the one left out to be the base variable. When using one-hot encoding, your weekday variable is present as a categorical value in one single column, effectively having the regression use the first of its values as the base.

like image 71
jbndlr Avatar answered Sep 24 '22 00:09

jbndlr


Technically 6- a day week is enough to provide a unique mapping for a vocabulary of size 7:

 1. Sunday    [0,0,0,0,0,0]
 2. Monday    [1,0,0,0,0,0]
 3. Tuesday   [0,1,0,0,0,0]
 4. Wednesday [0,0,1,0,0,0]
 5. Thursday  [0,0,0,1,0,0]
 6. Friday    [0,0,0,0,1,0]
 7. Saturday  [0,0,0,0,0,1]

dummy coding is a more compact representation, it is preferred in statistical models that perform better when the inputs are linearly independent.

Modern machine learning algorithms, though, don’t require their inputs to be linearly independent and use methods such as L1 regularization to prune redundant inputs. The additional degree of freedom allows the framework to transparently handle a missing input in production as all zeros.

 1. Sunday    [0,0,0,0,0,0,1]
 2. Monday    [0,0,0,0,0,1,0]
 3. Tuesday   [0,0,0,0,1,0,0]
 4. Wednesday [0,0,0,1,0,0,0]
 5. Thursday  [0,0,1,0,0,0,0]
 6. Friday    [0,1,0,0,0,0,0]
 7. Saturday  [1,0,0,0,0,0,0]

 for missing values : [0,0,0,0,0,0,0]
like image 31
Ravi Avatar answered Sep 24 '22 00:09

Ravi