This is similar to LabelEncoder from scikit-learn, but with the requirement that the number value assignments occur in order of frequency of the category, i.e., the higher occurring category being assigned the highest/lowest (depending on use-case) number.
E.g. If the variable can take values [a, b, c]
with frequencies such as
Category
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
2 c
2 c
a
occurs 5 times, b
occurs 10 times and c
occurs 2 times.
Then I want the replacements be done as b=1
, a=2
and c=3
.
We will be using . LabelEncoder() from sklearn library to convert categorical data to numerical data. We will use function fit_transform() in the process.
Method 1: Using replace() method Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people's salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.
Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.
See argsort
:
df['Order'] = df['Frequency'].argsort() + 1
df
returns
Category Frequency Order
0 a 5 3
1 b 10 1
2 c 2 2
If you are using pandas, you can use its map()
method:
import pandas as pd
data = pd.DataFrame([['a'], ['b'], ['c']], columns=['category'])
print(data)
category
0 a
1 b
2 c
mapping_dict = {'b':1, 'a':2, 'c':3}
print(data['category'].map(mapping_dict))
0 2
1 1
2 3
LabelEncoder uses np.unique to find the unique values present in a column which returns values in alphabetically sorted order, so you cannot use the custom ordering in it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With