My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification. Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?

The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers: <pre class="prettyprint"><code>>>> le = LabelEncoder() >>> le.fit(['a', 'e', 'b', 'z']) >>> le.classes_ array(['a', 'b', 'e', 'z'], dtype='U1') </code></pre> Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping: <pre class="prettyprint"><code>>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b']) [0, 2, 0, 3, 0, 1] >>> le.transform(['e', 'e', 'e']) [2, 2, 2] </code></pre> The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.

You could save the mapping: <code>string -> label</code> in training data with each column. <pre class="prettyprint"><code>>>> from sklearn import preprocessing >>> le = preprocessing.LabelEncoder() >>> col_1 = ["paris", "paris", "tokyo", "amsterdam"] >>> set_col_1 = list(set(col_1)) >>> le.fit(col_1) >>> dict(zip(set_col_1, le.transform(set_col_1))) {'amsterdam': 0, 'paris': 1, 'tokyo': 2} </code></pre> When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.

Encoding String to numbers so as to use it in scikit-learn

Tags:

encoding

machine-learning

scikit-learn

random-forest

My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification.

Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?

882

asked Jun 16 '15 13:06

Huga

2 Answers

The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers:

>>> le = LabelEncoder()
>>> le.fit(['a', 'e', 'b', 'z'])
>>> le.classes_
array(['a', 'b', 'e', 'z'], dtype='U1')

Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping:

>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b'])
[0, 2, 0, 3, 0, 1]
>>> le.transform(['e', 'e', 'e'])
[2, 2, 2]

The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.

187

answered Sep 24 '22 02:09

lmjohns3

You could save the mapping: string -> label in training data with each column.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> col_1 = ["paris", "paris", "tokyo", "amsterdam"]
>>> set_col_1 = list(set(col_1))
>>> le.fit(col_1)
>>> dict(zip(set_col_1, le.transform(set_col_1)))
{'amsterdam': 0, 'paris': 1, 'tokyo': 2}

When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.

answered Sep 27 '22 02:09

Chung-Yen Hung

Related questions
                            
                                Strange characters in Javascript causing it to not load
                            
                                Cannot replace £ with &pound from string
                            
                                Java: Gson and encoding
                            
                                How make InputStreamReader fail on invalid data for encoding?
                            
                                Refactoring auto-detect file's encoding
                            
                                LookupError: unknown encoding: cp0
                            
                                What is Unicode? and how Encoding works? [closed]
                            
                                Setting query results encoding in cx_Oracle / UnicodeDecodeError with Chinese characters
                            
                                "<" character in JSON data is serialized to \u003c
                            
                                Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings'
                            
                                MediaCodec is giving a storeMetaDataInBuffers trace error
                            
                                Printing unicode to console
                            
                                C# What is the difference between Text.Encoder and Text.Encoding
                            
                                How have Html entities inside asp.net page?
                            
                                open-uri returning ASCII-8BIT from webpage encoded in iso-8859
                            
                                writing French character in csv files in C#
                            
                                C/C++ encoding questions
                            
                                Prevent encoding errors in Python
                            
                                How to encode Java files in UTF-8 using Apache Ant?
                            
                                "Invalid argument" when using 3 part open in perl

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With