Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How handle categorical features in the latest Random Forest in Spark?

In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo What's about the ML Random Forest? In the user guide there is an example that uses VectorIndexer that converts the categorical features in vector as well, but it's written "Automatically identify categorical features, and index them"

In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest, and it's recommended to do one-hot encoding to avoid this, that seems to not make sense in the case of this algorithm, and especially given the official example mentioned above!

I noticed also that when having a lot of categories(>1000) in the categorical column, once they are indexed with StringIndexer, random forest algorithm asks me setting the MaxBin parameter, supposed to be used with continuous features. Does it mean that the features more than number of bins will be treated as continuous, as it's specified in the official example, and so StringIndexer is OK for my categorical column, or does it mean that the whole column with numerical still nominal features will be bucketized with assumption that the variables are continuous?

like image 759
Andrew_457 Avatar asked Oct 15 '17 20:10

Andrew_457


People also ask

Can Random Forests handle categorical variables?

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification.

How does Pyspark handle categorical data?

You can cast a string column type in a spark data frame to a numerical data type using the cast function. In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame.

How do you handle categorical features in a decision tree?

If the feature is categorical, the split is done with the elements belonging to a particular class. If the feature is contiuous, the split is done with the elements higher than a threshold. At every split, the decision tree will take the best variable at that moment.

What is a random forest in spark?

A random forest is actually an ensemble learning algorithm of decision tree ( ensemble learning algorithms combine multiple machine learning algorithms to obtain a better model). In this post I’m gonna use Random Forest to build a classification model with Apache Spark . (if you are new to Apache Spark please find more informations for here ).

How to build random forest classifier model from single record in spark?

Following is the structure/schema of single record. To build Random Forest classifier model from this data set first we need to load this data set into spark DataFrame. Following is the way to do that. It load the data into DataFrame from .CSV file based on the schema.

What is random forest a class?

A class that implements a Random Forest learning algorithm for classification and regression. It supports both continuous and categorical features.

How to tune random forest parameters using Spark ML?

Random forest comes with many parameters which we can tune. Tuning them manually is lot of work. So we can use cross validation facility provided by spark ML to search through these parameter space to come up with best parameters for our data. To do this with spark ML first we need to define the parameters that we need to tune.


1 Answers

In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest,

This is actually incorrect. Tree models (including RandomForest) depend on column metadata to distinguish between categorical and numerical variables. Metadata can be provided by ML transformers (like StringIndexer or VectorIndexer) or added manually. The old mllib RDD-based API, which is used internally by ml models, uses categoricalFeaturesInfo Map for the same purpose.

Current API just takes the metadata and converts to the format expected by categoricalFeaturesInfo.

OneHotEncoding is required only for linear models, and recommended, although not required, for multinomial naive Bayes classifier.

like image 52
zero323 Avatar answered Sep 20 '22 16:09

zero323