Given the difference between one-hot encoding and dummy coding, is the <code>pandas.get_dummies</code> method one-hot encoding when using default parameters (i.e. <code>drop_first=False</code>)? If so, does it make sense that I remove the intercept from the logistic regression model? Here is an example: <pre class="prettyprint"><code># I assume I have already my dataset in a DataFrame X and the true labels in y import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split X = pd.get_dummies(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .80) clf = LogisticRegression(fit_intercept=False) clf.fit(X_train, y_train) </code></pre>

Dummies are any variables that are either one or zero for each observation. <code>pd.get_dummies</code> when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding. One-hot encoding is characterized by having only one one per set of categorical values per observation. Consider the series <code>s</code> <pre class="prettyprint"><code>s = pd.Series(list('AABBCCABCDDEE')) s 0 A 1 A 2 B 3 B 4 C 5 C 6 A 7 B 8 C 9 D 10 D 11 E 12 E dtype: object </code></pre> <code>pd.get_dummies</code> will produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept. <pre class="prettyprint"><code>pd.get_dummies(s) A B C D E 0 1 0 0 0 0 1 1 0 0 0 0 2 0 1 0 0 0 3 0 1 0 0 0 4 0 0 1 0 0 5 0 0 1 0 0 6 1 0 0 0 0 7 0 1 0 0 0 8 0 0 1 0 0 9 0 0 0 1 0 10 0 0 0 1 0 11 0 0 0 0 1 12 0 0 0 0 1 </code></pre> However, if you had <code>s</code> include different data and used <code>pd.Series.str.get_dummies</code> <pre class="prettyprint"><code>s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(',')) s 0 A|B 1 A 2 B 3 B 4 C|D 5 D|B 6 A 7 B 8 C 9 A|D dtype: object </code></pre> Then <code>get_dummies</code> produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept. <pre class="prettyprint"><code>s.str.get_dummies() A B C D 0 1 1 0 0 1 1 0 0 0 2 0 1 0 0 3 0 1 0 0 4 0 0 1 1 5 0 1 0 1 6 1 0 0 0 7 0 1 0 0 8 0 0 1 0 9 1 0 0 1 </code></pre>

First question: yes, <code>pd.get_dummies()</code> is one-hot encoding in its default state; see example below, from pd.get_dummies docs: <pre class="prettyprint"><code>s = pd.Series(list('abca')) pd.get_dummies(s, drop_first=False) </code></pre> Second question: [edited now that OP includes code example] yes, if you are one-hot encoding the inputs to a logistic regression model, it is appropriate to skip the intercept.

Is pd.get_dummies one-hot encoding?

Tags:

python

pandas

scikit-learn

Given the difference between one-hot encoding and dummy coding, is the pandas.get_dummies method one-hot encoding when using default parameters (i.e. drop_first=False)?

If so, does it make sense that I remove the intercept from the logistic regression model? Here is an example:

# I assume I have already my dataset in a DataFrame X and the true labels in y
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .80)

clf = LogisticRegression(fit_intercept=False)
clf.fit(X_train, y_train)

927

asked Jan 09 '18 14:01

Mattia Paterna

2 Answers

Dummies are any variables that are either one or zero for each observation. pd.get_dummies when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.

One-hot encoding is characterized by having only one one per set of categorical values per observation.

Consider the series s

s = pd.Series(list('AABBCCABCDDEE'))

s

0     A
1     A
2     B
3     B
4     C
5     C
6     A
7     B
8     C
9     D
10    D
11    E
12    E
dtype: object

pd.get_dummies will produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept.

pd.get_dummies(s)

    A  B  C  D  E
0   1  0  0  0  0
1   1  0  0  0  0
2   0  1  0  0  0
3   0  1  0  0  0
4   0  0  1  0  0
5   0  0  1  0  0
6   1  0  0  0  0
7   0  1  0  0  0
8   0  0  1  0  0
9   0  0  0  1  0
10  0  0  0  1  0
11  0  0  0  0  1
12  0  0  0  0  1

However, if you had s include different data and used pd.Series.str.get_dummies

s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))

s

0    A|B
1      A
2      B
3      B
4    C|D
5    D|B
6      A
7      B
8      C
9    A|D
dtype: object

Then get_dummies produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.

s.str.get_dummies()

   A  B  C  D
0  1  1  0  0
1  1  0  0  0
2  0  1  0  0
3  0  1  0  0
4  0  0  1  1
5  0  1  0  1
6  1  0  0  0
7  0  1  0  0
8  0  0  1  0
9  1  0  0  1

130

answered Oct 02 '22 19:10

piRSquared

First question: yes, pd.get_dummies() is one-hot encoding in its default state; see example below, from pd.get_dummies docs:

s = pd.Series(list('abca'))
pd.get_dummies(s, drop_first=False)

Second question: [edited now that OP includes code example] yes, if you are one-hot encoding the inputs to a logistic regression model, it is appropriate to skip the intercept.

answered Oct 02 '22 20:10

muskrat

Related questions
                            
                                How do I check if raw input is integer in python 2.7?
                            
                                Django DecimalField generating "quantize result has too many digits for current context" error on save
                            
                                Keep finite entries only in Pandas
                            
                                Read Space-separated Data with Pandas [duplicate]
                            
                                In pandas/python, reading array stored as string
                            
                                Django - Filter queryset by CharField value length
                            
                                Saving nltk drawn parse tree to image file
                            
                                How to install pygments on Ubuntu?
                            
                                Numpy sort ndarray on multiple columns
                            
                                Concatenate python string from list entries [duplicate]
                            
                                Change Flask logs from INFO to DEBUG
                            
                                How to get table names using sqlite3 through python? [duplicate]
                            
                                PyMongo create_index only if it does not exist
                            
                                Bulk saving complex objects SQLAlchemy
                            
                                How to GET data in Flask from AJAX post
                            
                                Write to a file with sudo privileges in Python
                            
                                How do I get a regex pattern type for MyPy
                            
                                Open a csv.gz file in Python and print first 100 rows
                            
                                Plotting a time series?
                            
                                Python json.loads changes the order of the object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With