What is the difference between sklearn LabelEncoder and pd.get_dummies?

Tags:

I wanted to know the difference between sklearn LabelEncoder vs pandas get_dummies. Why would one choose LabelEncoder over get_dummies. What is the advantage of using one over another? Disadvantages?

As far as I understand if I have a class A

ClassA = ["Apple", "Ball", "Cat"]
encoder = [1, 2, 3]

and

dummy = [001, 010, 100]

Am I understanding this incorrectly?

532

asked Jul 16 '16 17:07

Sam

1 Answers

These are just convenience functions falling naturally into the way these two libraries tend to do things, respectively. The first one "condenses" the information by changing things to integers, and the second one "expands" the dimensions allowing (possibly) more convenient access.

sklearn.preprocessing.LabelEncoder simply transforms data, from whatever domain, so that its domain is 0, ..., k - 1, where k is the number of classes.

So, for example

["paris", "paris", "tokyo", "amsterdam"]

could become

[0, 0, 1, 2]

pandas.get_dummies also takes a Series with elements from some domain, but expands it into a DataFrame whose columns correspond to the entries in the series, and the values are 0 or 1 depending on what they originally were. So, for example, the same

["paris", "paris", "tokyo", "amsterdam"]

would become a DataFrame with labels

["paris", "tokyo", "amsterdam"]

and whose "paris" entry would be the series

[1, 1, 0, 0]

The main advantage of the first method is that it conserves space. Conversely, encoding things as integers might give the impression (to you or to some machine learning algorithm) that the order means something. Is "amsterdam" closer to "tokyo" than to "paris" just because of the integer encoding? probably not. The second representation is a bit clearer on that.

answered Oct 26 '22 04:10

Ami Tavory

Related questions
                            
                                Pandas Drop Rows Outside of Time Range
                            
                                S3 Object Expiration using boto
                            
                                How to convert numpy object array into str/unicode array?
                            
                                What is the difference between cholesky in numpy and scipy?
                            
                                Creating percentile buckets in pandas
                            
                                What is the best way to compute the trace of a matrix product in numpy?
                            
                                Can I somehow share an asynchronous queue with a subprocess?
                            
                                is Scrapy single-threaded or multi-threaded?
                            
                                Why does mysql connector break ("Lost connection to MySQL server during query" error)
                            
                                Scikit: calculate precision and recall using cross_val_score function
                            
                                AttributeError using pyBrain _splitWithPortion - object type changed?
                            
                                Python program to rename file names while overwriting if there already is that file
                            
                                Simple SELECT statement on existing table with SQLAlchemy
                            
                                OpenCV ORB detector finds very few keypoints
                            
                                how to efficiently exhaust an iterator in a oneliner?
                            
                                Get last output of dynamic_rnn in tensorflow?
                            
                                What is the time complexity of zip() in Python?
                            
                                click.Choice for multiple arguments
                            
                                Selecting columns with condition on Pandas DataFrame
                            
                                How to assert ValueError in pytest

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between sklearn LabelEncoder and pd.get_dummies?

Tags:

python

pandas

scikit-learn

Sam

People also ask

1 Answers

Ami Tavory

Recent Activity

Donate For Us