All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!
Now I know and I assume that internally, factorize
and LabelEncoder
work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.
get_dummies
and OneHotEncoder
will yield the same result but OneHotEncoder
can only handle numbers but get_dummies
will take all kinds of input. get_dummies
will generate new column names automatically for each column input, but OneHotEncoder
will not (it rather will assign new column names 1,2,3....). So get_dummies
is better in all respectives.
Please correct me if I am wrong! Thank you!
(1) The get_dummies can't handle the unknown category during the transformation natively. You have to apply some techniques to handle it. But it is not efficient. On the other hand, OneHotEncoder will natively handle unknown categories.
Looking at your problem , get_dummies is the option to go with as it would give equal weightage to the categorical variables. LabelEncoder is used when the categorical variables are ordinal i.e. if you are converting severity or ranking, then LabelEncoding "High" as 2 and "low" as 1 would make sense.
factorize() method helps to get the numeric representation of an array by identifying distinct values.
get_dummies() , allows you to easily one-hot encode your categorical data.
These four encoders can be split in two categories:
factorize
and scikit-learn LabelEncoder
. The result will have 1 dimension.get_dummies
and scikit-learn OneHotEncoder
. The result will have n dimensions, one by distinct value of the encoded categorical variable.The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit
and transform
methods.
Pandas factorize
and scikit-learn LabelEncoder
belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.
from sklearn import preprocessing
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])
print(df)
# Col Fact Lab
# 0 A 0 0
# 1 B 1 1
# 2 B 1 1
# 3 C 2 2
Pandas get_dummies
and scikit-learn OneHotEncoder
belong to the second category. They can be used to create binary variables. OneHotEncoder
can only be used with categorical integers while get_dummies
can be used with other type of variables.
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)
print(df)
# Col_A Col_B Col_C
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
# 0 1 2
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
I've also written a more detailed post based on this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With