In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo
What's about the ML Random Forest? In the user guide there is an example that uses VectorIndexer that converts the categorical features in vector as well, but it's written "Automatically identify categorical features, and index them"
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest, and it's recommended to do one-hot encoding to avoid this, that seems to not make sense in the case of this algorithm, and especially given the official example mentioned above!
I noticed also that when having a lot of categories(>1000) in the categorical column, once they are indexed with StringIndexer, random forest algorithm asks me setting the MaxBin parameter, supposed to be used with continuous features. Does it mean that the features more than number of bins will be treated as continuous, as it's specified in the official example, and so StringIndexer is OK for my categorical column, or does it mean that the whole column with numerical still nominal features will be bucketized with assumption that the variables are continuous?
One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification.
You can cast a string column type in a spark data frame to a numerical data type using the cast function. In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame.
If the feature is categorical, the split is done with the elements belonging to a particular class. If the feature is contiuous, the split is done with the elements higher than a threshold. At every split, the decision tree will take the best variable at that moment.
A random forest is actually an ensemble learning algorithm of decision tree ( ensemble learning algorithms combine multiple machine learning algorithms to obtain a better model). In this post I’m gonna use Random Forest to build a classification model with Apache Spark . (if you are new to Apache Spark please find more informations for here ).
Following is the structure/schema of single record. To build Random Forest classifier model from this data set first we need to load this data set into spark DataFrame. Following is the way to do that. It load the data into DataFrame from .CSV file based on the schema.
A class that implements a Random Forest learning algorithm for classification and regression. It supports both continuous and categorical features.
Random forest comes with many parameters which we can tune. Tuning them manually is lot of work. So we can use cross validation facility provided by spark ML to search through these parameter space to come up with best parameters for our data. To do this with spark ML first we need to define the parameters that we need to tune.
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest,
This is actually incorrect. Tree models (including RandomForest
) depend on column metadata to distinguish between categorical and numerical variables. Metadata can be provided by ML transformers (like StringIndexer
or VectorIndexer
) or added manually. The old mllib
RDD-based API, which is used internally by ml
models, uses categoricalFeaturesInfo
Map
for the same purpose.
Current API just takes the metadata and converts to the format expected by categoricalFeaturesInfo
.
OneHotEncoding
is required only for linear models, and recommended, although not required, for multinomial naive Bayes classifier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With