Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using OrdinalEncoder to transform categorical values

I have a dataset that has many columns

No  Name  Sex  Blood  Grade  Height  Study
1   Tom   M    O      56     160     Math
2   Harry M    A      76     192     Math
3   John  M    A      45     178     English
4   Nancy F    B      78     157     Biology
5   Mike  M    O      79     167     Math
6   Kate  F    AB     66     156     English
7   Mary  F    O      99     166     Science

I want to change it to be something like that

No  Name  Sex  Blood  Grade  Height  Study
1   Tom   0    0      56     160     0
2   Harry 0    1      76     192     0
3   John  0    1      45     178     1
4   Nancy 1    2      78     157     2
5   Mike  0    0      79     167     0
6   Kate  1    3      66     156     1
7   Mary  0    0      99     166     3

I know there is a libabrary can do that which is

from sklearn.preprocessing import OrdinalEncoder

I tried this but it did not work

enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])

can anyone help me finding what i am doing wrong and how to that?

Thanks

like image 891
asmgx Avatar asked Jun 08 '19 02:06

asmgx


People also ask

Can you transform a categorical variable?

- Categorical Variable Transformation: is turning a categorical variable to a numeric variable. Categorical variable transformation is mandatory for most of the machine learning models because they can handle only numeric values.

Does categorical data work with XGBoost?

Starting from version 1.5, XGBoost has experimental support for categorical data available for public testing.

How do you convert a categorical variable to a numerical variable?

Method 1: Dummy Variable Encoding get_dummies function to convert the categorical string data into numeric.

Do we encode categorical variables for decision tree?

Decision tree models can handle categorical variables without one-hot encoding them. However, popular implementations of decision trees (and random forests) differ as to whether they honor this fact. We show that one-hot encoding can seriously degrade tree-model performance.


2 Answers

You were almost there !

Basically the fit method, prepare the encoder (fit on your data i.e. prepare the mapping) but don't transform the data.

You have to call transform to transform the data , or use fit_transform which fit and transform the same data.

enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])

or directly

enc = OrdinalEncoder()
df[["Sex","Blood", "Study"]] = enc.fit_transform(df[["Sex","Blood", "Study"]])

Note: The values won't be the one that you provided, since internally the fit method use numpy.unique which gives result sorted in alphabetic order and not by order of appearance.

As you can see from enc.categories_

[array(['F', 'M'], dtype=object),
 array(['A', 'AB', 'B', 'O'], dtype=object),
 array(['Biology', 'English', 'Math', 'Science'], dtype=object)]```

Each value in the array is encoded by it's position. (F will be encoded as 0 , M as 1)

like image 63
abcdaire Avatar answered Nov 15 '22 20:11

abcdaire


I think it is important to point out that this is not an example for an ordinal encoding of variables. Sex, Blood and Study should all not have an ordinal scale (and was also not suggested by the person, who asked the question). Ordinal data has a ranking (see e.g. https://en.wikipedia.org/wiki/Ordinal_data) Those examples here do not have a ranking.

In the case that your variable is a target variable you can use the LabelEncoder.(https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

Then you can do something like:

from sklearn.preprocessing import LabelEncoder

for col in ["Sex","Blood", "Study"]:
    df[col] = LabelEncoder().fit_transform(df[col])

If your variables are features you should use the Ordinalencoder for accomplishing this. (See comments to my answer).

The naming for the Ordinalencoder is quite unfortunate as "ordinal" is seen from a mathematical and not a statistical naming perspective.

More on the difference between ordinal- and labelencoder in sklearn: https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder

like image 40
Createdd Avatar answered Nov 15 '22 20:11

Createdd