This is similar to LabelEncoder from scikit-learn, but with the requirement that the number value assignments occur in order of frequency of the category, i.e., the higher occurring category being assigned the highest/lowest (depending on use-case) number. E.g. If the variable can take values <code>[a, b, c]</code> with frequencies such as <pre class="prettyprint"><code> Category 0 a 0 a 0 a 0 a 0 a 1 b 1 b 1 b 1 b 1 b 1 b 1 b 1 b 1 b 1 b 2 c 2 c </code></pre> <code>a</code> occurs 5 times, <code>b</code> occurs 10 times and <code>c</code> occurs 2 times. Then I want the replacements be done as <code>b=1</code>, <code>a=2</code> and <code>c=3</code>.

See <code>argsort</code>: <pre class="prettyprint"><code>df['Order'] = df['Frequency'].argsort() + 1 df </code></pre> returns <pre class="prettyprint"><code> Category Frequency Order 0 a 5 3 1 b 10 1 2 c 2 2 </code></pre>

Converting categorical variables to numbers based on frequency in a single line

Tags:

python

pandas

numpy

scikit-learn

This is similar to LabelEncoder from scikit-learn, but with the requirement that the number value assignments occur in order of frequency of the category, i.e., the higher occurring category being assigned the highest/lowest (depending on use-case) number.

E.g. If the variable can take values [a, b, c] with frequencies such as

a occurs 5 times, b occurs 10 times and c occurs 2 times. Then I want the replacements be done as b=1, a=2 and c=3.

956

asked Sep 16 '18 17:09

goelakash

2 Answers

See argsort:

df['Order'] = df['Frequency'].argsort() + 1
df

returns

  Category  Frequency  Order
0        a          5      3
1        b         10      1
2        c          2      2

164

answered Oct 15 '22 04:10

Alex

If you are using pandas, you can use its map() method:

import pandas as pd
data = pd.DataFrame([['a'], ['b'], ['c']], columns=['category'])

print(data)

  category
0        a
1        b
2        c

mapping_dict = {'b':1, 'a':2, 'c':3}

print(data['category'].map(mapping_dict))

0    2
1    1
2    3

LabelEncoder uses np.unique to find the unique values present in a column which returns values in alphabetically sorted order, so you cannot use the custom ordering in it.

answered Oct 15 '22 04:10

Vivek Kumar

Related questions
                            
                                What does X_set[y_set == j, 0] mean?
                            
                                Setting Icon for PyInstaller Application
                            
                                Predicted values of each fold in K-Fold Cross Validation in sklearn
                            
                                Why is that slicing expression generating that output [duplicate]
                            
                                Keep x/y axes the same lengths in seaborn/matplotlib
                            
                                non-uniform spacing with numpy.gradient
                            
                                Sparse DataArray Xarray search
                            
                                Why is dataclasses.astuple returning a deepcopy of class attributes?
                            
                                Splitting up pybind11 modules and issues with automatic type conversion
                            
                                Google Sheets API for python2.7 --> "Invalid JSON payload. Root element must be a message"
                            
                                connecting mysql with pyspark
                            
                                sqlalchemy joinedload: syntax to load multiple relationships more than 1 degree separated from query table?
                            
                                C++ - vector version implement of argsort low effiency compared to the one in numpy
                            
                                Run schedule function in new thread
                            
                                How to use tf.contrib.model_pruning on MNIST?
                            
                                Debug Python in Docker Container
                            
                                Why do MFCC extraction libs return different values?
                            
                                Python3 reading mixed text/binary data line-by-line
                            
                                MySQL-python installation failed from python-alpine
                            
                                Swap a TensorFlow Dataset input pipeline with a placeholder after training

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With