Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder [closed]

Tags:

All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

Now I know and I assume that internally, factorize and LabelEncoder work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.

get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input. get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (it rather will assign new column names 1,2,3....). So get_dummies is better in all respectives.

Please correct me if I am wrong! Thank you!

448

asked Oct 31 '16 04:10

Richard Ji

1 Answers

These four encoders can be split in two categories:

Encode labels into categorical variables: Pandas factorize and scikit-learn LabelEncoder. The result will have 1 dimension.
Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummies and scikit-learn OneHotEncoder. The result will have n dimensions, one by distinct value of the encoded categorical variable.

The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit and transform methods.

Encode labels into categorical variables

Pandas factorize and scikit-learn LabelEncoder belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.

Click to copy

from sklearn import preprocessing    
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])    
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])

print(df)
#   Col  Fact  Lab
# 0   A     0    0
# 1   B     1    1
# 2   B     1    1
# 3   C     2    2

Encode categorical variable into dummy/indicator (binary) variables

Pandas get_dummies and scikit-learn OneHotEncoder belong to the second category. They can be used to create binary variables. OneHotEncoder can only be used with categorical integers while get_dummies can be used with other type of variables.

Click to copy

df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)

print(df)
#    Col_A  Col_B  Col_C
# 0    1.0    0.0    0.0
# 1    0.0    1.0    0.0
# 2    0.0    1.0    0.0
# 3    0.0    0.0    1.0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())

print(df)
#      0    1    2
# 0  1.0  0.0  0.0
# 1  0.0  1.0  0.0
# 2  0.0  1.0  0.0
# 3  0.0  0.0  1.0

I've also written a more detailed post based on this answer.

167

answered Oct 01 '22 19:10

Romain

Related questions
                            
                                install python3 + lxml on windows
                            
                                Python: create dict from list and auto-gen/increment the keys (list is the actual key values)?
                            
                                What is the best way to loop through and process a large (10GB+) text file?
                            
                                UnicodeWarning: special characters in Tkinter
                            
                                How can you easily select between PyQt or PySide at runtime?
                            
                                Broken pipe error in Django Nonrel when loading localhost
                            
                                How to use mock in django?
                            
                                Parse Adobe Illustrator (.ai) files with Python
                            
                                Python argument passing in object oriented programming
                            
                                Is a context manager right for this job?
                            
                                Using multiple Python shells in Emacs 'python-mode' with Python or IPython
                            
                                How can I make setuptools ignore subversion inventory?
                            
                                How to get a python script to invoke "python -i" when called normally?
                            
                                How to convert JSON to XLS in Python
                            
                                Is there a better way to switch between HTML and JSON output in Pyramid?
                            
                                Even numbers in Python
                            
                                Is `a<b<c` valid python?
                            
                                Python: find contour lines from matplotlib.pyplot.contour()
                            
                                Implementing Flask-Login with multiple User Classes
                            
                                OpenCV in Ubuntu 17.04

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder [closed]

Tags:

python

pandas

encoding

machine-learning

scikit-learn