I have a dataset that has many columns
No Name Sex Blood Grade Height Study
1 Tom M O 56 160 Math
2 Harry M A 76 192 Math
3 John M A 45 178 English
4 Nancy F B 78 157 Biology
5 Mike M O 79 167 Math
6 Kate F AB 66 156 English
7 Mary F O 99 166 Science
I want to change it to be something like that
No Name Sex Blood Grade Height Study
1 Tom 0 0 56 160 0
2 Harry 0 1 76 192 0
3 John 0 1 45 178 1
4 Nancy 1 2 78 157 2
5 Mike 0 0 79 167 0
6 Kate 1 3 66 156 1
7 Mary 0 0 99 166 3
I know there is a libabrary can do that which is
from sklearn.preprocessing import OrdinalEncoder
I tried this but it did not work
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
can anyone help me finding what i am doing wrong and how to that?
Thanks
- Categorical Variable Transformation: is turning a categorical variable to a numeric variable. Categorical variable transformation is mandatory for most of the machine learning models because they can handle only numeric values.
Starting from version 1.5, XGBoost has experimental support for categorical data available for public testing.
Method 1: Dummy Variable Encoding get_dummies function to convert the categorical string data into numeric.
Decision tree models can handle categorical variables without one-hot encoding them. However, popular implementations of decision trees (and random forests) differ as to whether they honor this fact. We show that one-hot encoding can seriously degrade tree-model performance.
You were almost there !
Basically the fit
method, prepare the encoder (fit on your data i.e. prepare the mapping) but don't transform the data.
You have to call transform
to transform the data , or use fit_transform
which fit and transform the same data.
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])
or directly
enc = OrdinalEncoder()
df[["Sex","Blood", "Study"]] = enc.fit_transform(df[["Sex","Blood", "Study"]])
Note: The values won't be the one that you provided, since internally the fit method use numpy.unique
which gives result sorted in alphabetic order and not by order of appearance.
As you can see from enc.categories_
[array(['F', 'M'], dtype=object),
array(['A', 'AB', 'B', 'O'], dtype=object),
array(['Biology', 'English', 'Math', 'Science'], dtype=object)]```
Each value in the array is encoded by it's position. (F will be encoded as 0 , M as 1)
I think it is important to point out that this is not an example for an ordinal encoding of variables. Sex, Blood and Study should all not have an ordinal scale (and was also not suggested by the person, who asked the question). Ordinal data has a ranking (see e.g. https://en.wikipedia.org/wiki/Ordinal_data) Those examples here do not have a ranking.
In the case that your variable is a target variable you can use the LabelEncoder.(https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)
Then you can do something like:
from sklearn.preprocessing import LabelEncoder
for col in ["Sex","Blood", "Study"]:
df[col] = LabelEncoder().fit_transform(df[col])
If your variables are features you should use the Ordinalencoder for accomplishing this. (See comments to my answer).
The naming for the Ordinalencoder is quite unfortunate as "ordinal" is seen from a mathematical and not a statistical naming perspective.
More on the difference between ordinal- and labelencoder in sklearn: https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With