I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set.
In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a categorical feature will be assigned numeric value based on it's frequency (0 for most frequent label of a category feature).
My question is how the algorithm of Random Forest or Decision Tree will understand that new features (derived from categorical features) are different than continuous variable. Will indexed feature be considered as continuous in the algorithm? Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features.
I read some of the answers from this forum but i didn't get clarity on the last part.
One Hot Encoding should be done for categorical variables with categories > 2.
To understand why, you should know the difference between the sub categories of categorical data: Ordinal data
and Nominal data
.
Ordinal Data: The values has some sort of ordering between them. example:
Customer Feedback(excellent, good, neutral, bad, very bad). As you can see there is a clear ordering between them (excellent > good > neutral > bad > very bad). In this case StringIndexer
alone is sufficient for modelling purpose.
Nominal Data: The values has no defined ordering between them.
example: colours(black, blue, white, ...). In this case StringIndexer
alone is NOT sufficient. and One Hot Encoding
is required after String Indexing
.
After String Indexing
lets assume the output is:
id | colour | categoryIndex
----|----------|---------------
0 | black | 0.0
1 | white | 1.0
2 | yellow | 2.0
3 | red | 3.0
Then without One Hot Encoding
, the machine learning algorithm will assume: red > yellow > white > black
, which we know its not true.
OneHotEncoder()
will help us avoid this situation.
So to answer your question,
Will indexed feature be considered as continuous in the algorithm?
It will be considered as continious variable.
Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features
depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding
, most ML algorithms need it.
Refer: https://spark.apache.org/docs/latest/ml-features.html#onehotencoder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With