Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is pd.get_dummies one-hot encoding?

Given the difference between one-hot encoding and dummy coding, is the pandas.get_dummies method one-hot encoding when using default parameters (i.e. drop_first=False)?

If so, does it make sense that I remove the intercept from the logistic regression model? Here is an example:

# I assume I have already my dataset in a DataFrame X and the true labels in y
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .80)

clf = LogisticRegression(fit_intercept=False)
clf.fit(X_train, y_train)
like image 927
Mattia Paterna Avatar asked Jan 09 '18 14:01

Mattia Paterna


People also ask

Is Get_dummies same as one hot encoding?

Both OneHotEncoder and get_dummies give the same results. But there are some important differences between them. (1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it.

Is Pandas get Dummies one hot encoding?

This is why, we need encoding methods to convert non-numerical data to meaningful numerical data. For this we look at Pandas get_dummies method. get_dummies is one of the easiest way to implement one hot encoding method and it has very useful parameters, of which we will mention the most important ones.

How does PD Get_dummies work?

The get_dummies() function is used to convert categorical variable into dummy/indicator variables. Data of which to get dummy indicators. String to append DataFrame column names. If appending prefix, separator/delimiter to use.

What is hot encoding in Pandas?

One-hot Encoding is a type of vector representation in which all of the elements in a vector are 0, except for one, which has 1 as its value, where 1 represents a boolean specifying a category of the element.


2 Answers

Dummies are any variables that are either one or zero for each observation. pd.get_dummies when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.

One-hot encoding is characterized by having only one one per set of categorical values per observation.

Consider the series s

s = pd.Series(list('AABBCCABCDDEE'))

s

0     A
1     A
2     B
3     B
4     C
5     C
6     A
7     B
8     C
9     D
10    D
11    E
12    E
dtype: object

pd.get_dummies will produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept.

pd.get_dummies(s)

    A  B  C  D  E
0   1  0  0  0  0
1   1  0  0  0  0
2   0  1  0  0  0
3   0  1  0  0  0
4   0  0  1  0  0
5   0  0  1  0  0
6   1  0  0  0  0
7   0  1  0  0  0
8   0  0  1  0  0
9   0  0  0  1  0
10  0  0  0  1  0
11  0  0  0  0  1
12  0  0  0  0  1

However, if you had s include different data and used pd.Series.str.get_dummies

s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))

s

0    A|B
1      A
2      B
3      B
4    C|D
5    D|B
6      A
7      B
8      C
9    A|D
dtype: object

Then get_dummies produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.

s.str.get_dummies()

   A  B  C  D
0  1  1  0  0
1  1  0  0  0
2  0  1  0  0
3  0  1  0  0
4  0  0  1  1
5  0  1  0  1
6  1  0  0  0
7  0  1  0  0
8  0  0  1  0
9  1  0  0  1
like image 130
piRSquared Avatar answered Oct 02 '22 19:10

piRSquared


First question: yes, pd.get_dummies() is one-hot encoding in its default state; see example below, from pd.get_dummies docs:

s = pd.Series(list('abca'))
pd.get_dummies(s, drop_first=False)

Second question: [edited now that OP includes code example] yes, if you are one-hot encoding the inputs to a logistic regression model, it is appropriate to skip the intercept.

like image 38
muskrat Avatar answered Oct 02 '22 20:10

muskrat