I'm making features for a machine learning model. I'm confused with dummy variable and one-hot encoding.For a instance,a category variable <code>'week'</code> range 1-7.When using one-hot encoding, encode <code>week = 1</code> as 1,000,000,<code>week = 2</code> is 0,100,000... .But I can also make a dummy variable <code>'week_v'</code>,and in this way, I must set a <code>hidden variable</code> which means base variable,and feature <code>week_v = 1</code> is 100,000,<code>week_v = 2</code> is 010,000... and does not appear <code>week_v = 7</code>.So what's the difference between them? I'm using logistic model and then I'll try gbdt.

In fact, there is no difference in the effect of the two approaches (rather wordings) on your regression. In either case, you have to make sure that one of your dummies is left out (i.e. serves as base assumption) to avoid perfect multicollinearity among the set. For instance, if you want to take the <code>weekday</code> of an observation into account, you only use 6 (not 7) dummies assuming the one left out to be the base variable. When using one-hot encoding, your <code>weekday</code> variable is present as a categorical value in one single column, effectively having the regression use the first of its values as the base.

What's the difference between dummy variable and one-hot encoding?

Tags:

python

machine-learning

I'm making features for a machine learning model. I'm confused with dummy variable and one-hot encoding.For a instance,a category variable 'week' range 1-7.When using one-hot encoding, encode week = 1 as 1,000,000,week = 2 is 0,100,000... .But I can also make a dummy variable 'week_v',and in this way, I must set a hidden variable which means base variable,and feature week_v = 1 is 100,000,week_v = 2 is 010,000... and does not appear week_v = 7.So what's the difference between them? I'm using logistic model and then I'll try gbdt.

934

asked Dec 14 '16 07:12

Peng He

Video Answer

2 Answers

In fact, there is no difference in the effect of the two approaches (rather wordings) on your regression.

In either case, you have to make sure that one of your dummies is left out (i.e. serves as base assumption) to avoid perfect multicollinearity among the set.

For instance, if you want to take the weekday of an observation into account, you only use 6 (not 7) dummies assuming the one left out to be the base variable. When using one-hot encoding, your weekday variable is present as a categorical value in one single column, effectively having the regression use the first of its values as the base.

answered Sep 24 '22 00:09

jbndlr

Technically 6- a day week is enough to provide a unique mapping for a vocabulary of size 7:

 1. Sunday    [0,0,0,0,0,0]
 2. Monday    [1,0,0,0,0,0]
 3. Tuesday   [0,1,0,0,0,0]
 4. Wednesday [0,0,1,0,0,0]
 5. Thursday  [0,0,0,1,0,0]
 6. Friday    [0,0,0,0,1,0]
 7. Saturday  [0,0,0,0,0,1]

dummy coding is a more compact representation, it is preferred in statistical models that perform better when the inputs are linearly independent.

Modern machine learning algorithms, though, don’t require their inputs to be linearly independent and use methods such as L1 regularization to prune redundant inputs. The additional degree of freedom allows the framework to transparently handle a missing input in production as all zeros.

 1. Sunday    [0,0,0,0,0,0,1]
 2. Monday    [0,0,0,0,0,1,0]
 3. Tuesday   [0,0,0,0,1,0,0]
 4. Wednesday [0,0,0,1,0,0,0]
 5. Thursday  [0,0,1,0,0,0,0]
 6. Friday    [0,1,0,0,0,0,0]
 7. Saturday  [1,0,0,0,0,0,0]

 for missing values : [0,0,0,0,0,0,0]

answered Sep 24 '22 00:09

Ravi

Related questions
                            
                                Definition of matplotlib.pyplot.axes.bbox
                            
                                Mocking ftplib.FTP for unit testing Python code
                            
                                In matplotlib, what's the difference between title() and suptitle()?
                            
                                Profiling memory usage on App Engine
                            
                                Python calculating Catalan Numbers
                            
                                How to check if a function is pure in Python?
                            
                                The differences between MySQLdb and mysqlconnector
                            
                                Import a module from a directory (package) one level up
                            
                                Can pandas.DataFrame have list type column?
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                Add metadata comment to Numpy ndarray
                            
                                How to use technical indicators of TA-Lib with pandas in python
                            
                                How to send a colored text message?
                            
                                Jupyter: Write a custom magic that modifies the contents of the cell it's in
                            
                                zip_longest without fillvalue
                            
                                How to optimize multiprocessing in Python
                            
                                How to split a list into n groups in all possible combinations of group length and elements within group?
                            
                                Spyder 3 "Set Console Working Directory" not working
                            
                                How do I feed Tensorflow placeholders with numpy arrays?
                            
                                What should I put in the body of an abstract method?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With