How handle categorical features in the latest Random Forest in Spark?

Tags:

In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo What's about the ML Random Forest? In the user guide there is an example that uses VectorIndexer that converts the categorical features in vector as well, but it's written "Automatically identify categorical features, and index them"

In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest, and it's recommended to do one-hot encoding to avoid this, that seems to not make sense in the case of this algorithm, and especially given the official example mentioned above!

I noticed also that when having a lot of categories(>1000) in the categorical column, once they are indexed with StringIndexer, random forest algorithm asks me setting the MaxBin parameter, supposed to be used with continuous features. Does it mean that the features more than number of bins will be treated as continuous, as it's specified in the official example, and so StringIndexer is OK for my categorical column, or does it mean that the whole column with numerical still nominal features will be bucketized with assumption that the variables are continuous?

759

asked Oct 15 '17 20:10

Andrew_457

1 Answers

In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest,

This is actually incorrect. Tree models (including RandomForest) depend on column metadata to distinguish between categorical and numerical variables. Metadata can be provided by ML transformers (like StringIndexer or VectorIndexer) or added manually. The old mllib RDD-based API, which is used internally by ml models, uses categoricalFeaturesInfo Map for the same purpose.

Current API just takes the metadata and converts to the format expected by categoricalFeaturesInfo.

OneHotEncoding is required only for linear models, and recommended, although not required, for multinomial naive Bayes classifier.

answered Sep 20 '22 16:09

zero323

Related questions
                            
                                Spark Python: Standard scaler error "Do not support ... SparseVector"
                            
                                is there any pyspark function for add next month like DATE_ADD(date, month(int type))
                            
                                What is the use of queryExecution in spark dataframe?
                            
                                Apache Spark UDF that returns dynamic data types
                            
                                How to save bucketed DataFrame?
                            
                                how to list spark-packages added to the spark context?
                            
                                UDF to map words to term Index in Spark
                            
                                how does YARN "Fair Scheduler" work with spark-submit configuration parameter
                            
                                how to change column value in spark sql
                            
                                How to write streaming dataset to Kafka?
                            
                                Kafka with Spark 2.1 Structured Streaming - cannot deserialize
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                Spark Pipeline error
                            
                                spring autoconfiguration class is missing in META-INF/spring.factories
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                How to cache partitioned dataset and use in multiple queries?
                            
                                Pyspark udf high memory utilization
                            
                                Enum equivalent in Spark Dataframe/Parquet
                            
                                Cumulative distinct count with Spark SQL
                            
                                pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How handle categorical features in the latest Random Forest in Spark?

Tags:

apache-spark

random-forest

apache-spark-ml

apache-spark-mllib

feature-engineering

Andrew_457

People also ask

1 Answers

zero323

Recent Activity

Donate For Us