I wanted to know the difference between sklearn LabelEncoder vs pandas get_dummies. Why would one choose LabelEncoder over get_dummies. What is the advantage of using one over another? Disadvantages?
As far as I understand if I have a class A
ClassA = ["Apple", "Ball", "Cat"]
encoder = [1, 2, 3]
and
dummy = [001, 010, 100]
Am I understanding this incorrectly?
LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. Fit label encoder. Fit label encoder and return encoded labels.
As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.
get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables. syntax: pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
One Hot Encoder: get_dummies() with added advantages. It encodes a nominal or categorical feature by assigning one binary column per category per categorical feature. Scikit-learn comes with the implementation of the one-hot encoder.
These are just convenience functions falling naturally into the way these two libraries tend to do things, respectively. The first one "condenses" the information by changing things to integers, and the second one "expands" the dimensions allowing (possibly) more convenient access.
sklearn.preprocessing.LabelEncoder
simply transforms data, from whatever domain, so that its domain is 0, ..., k - 1, where k is the number of classes.
So, for example
["paris", "paris", "tokyo", "amsterdam"]
could become
[0, 0, 1, 2]
pandas.get_dummies
also takes a Series with elements from some domain, but expands it into a DataFrame whose columns correspond to the entries in the series, and the values are 0 or 1 depending on what they originally were. So, for example, the same
["paris", "paris", "tokyo", "amsterdam"]
would become a DataFrame with labels
["paris", "tokyo", "amsterdam"]
and whose "paris"
entry would be the series
[1, 1, 0, 0]
The main advantage of the first method is that it conserves space. Conversely, encoding things as integers might give the impression (to you or to some machine learning algorithm) that the order means something. Is "amsterdam" closer to "tokyo" than to "paris" just because of the integer encoding? probably not. The second representation is a bit clearer on that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With