Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting categorical variables to numbers based on frequency in a single line

This is similar to LabelEncoder from scikit-learn, but with the requirement that the number value assignments occur in order of frequency of the category, i.e., the higher occurring category being assigned the highest/lowest (depending on use-case) number.

E.g. If the variable can take values [a, b, c] with frequencies such as

  Category 
0        a 
0        a 
0        a 
0        a 
0        a 
1        b 
1        b 
1        b 
1        b 
1        b 
1        b 
1        b 
1        b 
1        b 
1        b 
2        c 
2        c 

a occurs 5 times, b occurs 10 times and c occurs 2 times. Then I want the replacements be done as b=1, a=2 and c=3.

like image 956
goelakash Avatar asked Sep 16 '18 17:09

goelakash


People also ask

How do you convert a categorical variable to a numerical variable?

We will be using . LabelEncoder() from sklearn library to convert categorical data to numerical data. We will use function fit_transform() in the process.

How do you convert categorical values to numerical values?

Method 1: Using replace() method Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people's salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.

How do you handle categorical variables in linear regression?

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.


2 Answers

See argsort:

df['Order'] = df['Frequency'].argsort() + 1
df

returns

  Category  Frequency  Order
0        a          5      3
1        b         10      1
2        c          2      2
like image 164
Alex Avatar answered Oct 15 '22 04:10

Alex


If you are using pandas, you can use its map() method:

import pandas as pd
data = pd.DataFrame([['a'], ['b'], ['c']], columns=['category'])

print(data)

  category
0        a
1        b
2        c

mapping_dict = {'b':1, 'a':2, 'c':3}

print(data['category'].map(mapping_dict))

0    2
1    1
2    3

LabelEncoder uses np.unique to find the unique values present in a column which returns values in alphabetically sorted order, so you cannot use the custom ordering in it.

like image 44
Vivek Kumar Avatar answered Oct 15 '22 04:10

Vivek Kumar