Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between sklearn LabelEncoder and pd.get_dummies?

I wanted to know the difference between sklearn LabelEncoder vs pandas get_dummies. Why would one choose LabelEncoder over get_dummies. What is the advantage of using one over another? Disadvantages?

As far as I understand if I have a class A

ClassA = ["Apple", "Ball", "Cat"]
encoder = [1, 2, 3]

and

dummy = [001, 010, 100]

Am I understanding this incorrectly?

like image 532
Sam Avatar asked Jul 16 '16 17:07

Sam


People also ask

What does Sklearn LabelEncoder do?

LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. Fit label encoder. Fit label encoder and return encoded labels.

What is the difference between OneHotEncoder and LabelEncoder?

As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.

What is Pandas function Get_dummies?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables. syntax: pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

Is Get_dummies One Hot Encoding?

One Hot Encoder: get_dummies() with added advantages. It encodes a nominal or categorical feature by assigning one binary column per category per categorical feature. Scikit-learn comes with the implementation of the one-hot encoder.


1 Answers

These are just convenience functions falling naturally into the way these two libraries tend to do things, respectively. The first one "condenses" the information by changing things to integers, and the second one "expands" the dimensions allowing (possibly) more convenient access.


sklearn.preprocessing.LabelEncoder simply transforms data, from whatever domain, so that its domain is 0, ..., k - 1, where k is the number of classes.

So, for example

["paris", "paris", "tokyo", "amsterdam"]

could become

[0, 0, 1, 2]

pandas.get_dummies also takes a Series with elements from some domain, but expands it into a DataFrame whose columns correspond to the entries in the series, and the values are 0 or 1 depending on what they originally were. So, for example, the same

["paris", "paris", "tokyo", "amsterdam"]

would become a DataFrame with labels

["paris", "tokyo", "amsterdam"]

and whose "paris" entry would be the series

[1, 1, 0, 0]

The main advantage of the first method is that it conserves space. Conversely, encoding things as integers might give the impression (to you or to some machine learning algorithm) that the order means something. Is "amsterdam" closer to "tokyo" than to "paris" just because of the integer encoding? probably not. The second representation is a bit clearer on that.

like image 77
Ami Tavory Avatar answered Oct 26 '22 04:10

Ami Tavory