I'm making features for a machine learning model. I'm confused with dummy variable and one-hot encoding.For a instance,a category variable 'week'
range 1-7.When using one-hot encoding, encode week = 1
as 1,000,000,week = 2
is 0,100,000... .But I can also make a dummy variable 'week_v'
,and in this way, I must set a
hidden variable
which means base variable,and feature week_v = 1
is 100,000,week_v = 2
is 010,000... and
does not appear week_v = 7
.So what's the difference between them? I'm using logistic model and then I'll try gbdt.
What is the Dummy Variable Trap? The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models.
The Pandas get dummies function, pd. get_dummies() , allows you to easily one-hot encode your categorical data.
Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy. In addition, you may wish to transform the values back to categorical form so that they may be displayed in your application.
One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category.
In fact, there is no difference in the effect of the two approaches (rather wordings) on your regression.
In either case, you have to make sure that one of your dummies is left out (i.e. serves as base assumption) to avoid perfect multicollinearity among the set.
For instance, if you want to take the weekday
of an observation into account, you only use 6 (not 7) dummies assuming the one left out to be the base variable. When using one-hot encoding, your weekday
variable is present as a categorical value in one single column, effectively having the regression use the first of its values as the base.
Technically 6- a day week is enough to provide a unique mapping for a vocabulary of size 7:
1. Sunday [0,0,0,0,0,0]
2. Monday [1,0,0,0,0,0]
3. Tuesday [0,1,0,0,0,0]
4. Wednesday [0,0,1,0,0,0]
5. Thursday [0,0,0,1,0,0]
6. Friday [0,0,0,0,1,0]
7. Saturday [0,0,0,0,0,1]
dummy coding is a more compact representation, it is preferred in statistical models that perform better when the inputs are linearly independent.
Modern machine learning algorithms, though, don’t require their inputs to be linearly independent and use methods such as L1 regularization to prune redundant inputs. The additional degree of freedom allows the framework to transparently handle a missing input in production as all zeros.
1. Sunday [0,0,0,0,0,0,1]
2. Monday [0,0,0,0,0,1,0]
3. Tuesday [0,0,0,0,1,0,0]
4. Wednesday [0,0,0,1,0,0,0]
5. Thursday [0,0,1,0,0,0,0]
6. Friday [0,1,0,0,0,0,0]
7. Saturday [1,0,0,0,0,0,0]
for missing values : [0,0,0,0,0,0,0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With