Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn's LabelBinarizer vs. OneHotEncoder

What is the difference between the two? It seems that both create new columns, which their number is equal to the number of unique categories in the feature. Then they assign 0 and 1 to data points depending on what category they are in.

like image 564
Roozbeh Bakhshi Avatar asked May 22 '18 17:05

Roozbeh Bakhshi


People also ask

What is LabelBinarizer in Sklearn?

LabelBinarizer makes this process easy with the transform method. At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the inverse_transform method. Read more in the User Guide.

What is LabelBinarizer used for?

Label Binarizer is an SciKit Learn class that accepts Categorical data as input and returns an Numpy array. Unlike Label Encoder, it encodes the data into dummy variables indicating the presence of a particular label or not. Encoding make column data using Label Binarizer.

What is OneHotEncoder Sklearn?

OneHotEncoder. Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

What is the difference between Labelencoder and one-hot encoder?

As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.


2 Answers

A simple example which encodes an array using LabelEncoder, OneHotEncoder, LabelBinarizer is shown below.

I see that OneHotEncoder needs data in integer encoded form first to convert into its respective encoding which is not required in the case of LabelBinarizer.

from numpy import array from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer  # define example data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold',  'warm', 'hot'] values = array(data) print "Data: ", values # integer encode label_encoder = LabelEncoder() integer_encoded = label_encoder.fit_transform(values) print "Label Encoder:" ,integer_encoded  # onehot encode onehot_encoder = OneHotEncoder(sparse=False) integer_encoded = integer_encoded.reshape(len(integer_encoded), 1) onehot_encoded = onehot_encoder.fit_transform(integer_encoded) print "OneHot Encoder:", onehot_encoded  #Binary encode lb = LabelBinarizer() print "Label Binarizer:", lb.fit_transform(values) 

enter image description here

Another good link which explains the OneHotEncoder is: Explain onehotencoder using python

There may be other valid differences between the two which experts can probably explain.

like image 77
Rahul Pant Avatar answered Sep 22 '22 21:09

Rahul Pant


A difference is that you can use OneHotEncoder for multi column data, while not for LabelBinarizer and LabelEncoder.

from sklearn.preprocessing import LabelBinarizer, LabelEncoder, OneHotEncoder  X = [["US", "M"], ["UK", "M"], ["FR", "F"]] OneHotEncoder().fit_transform(X).toarray()  # array([[0., 0., 1., 0., 1.], #        [0., 1., 0., 0., 1.], #        [1., 0., 0., 1., 0.]]) 
LabelBinarizer().fit_transform(X) # ValueError: Multioutput target data is not supported with label binarization  LabelEncoder().fit_transform(X) # ValueError: bad input shape (3, 2) 
like image 32
Kota Mori Avatar answered Sep 24 '22 21:09

Kota Mori