PySpark & MLLib: Random Forest Feature Importances

Tags:

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel.

How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark?

Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it.

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

I don't see a model.__featureImportances_ attribute available -- where can I find this?

500

asked Mar 10 '15 19:03

Bryan

2 Answers

UPDATE for version > 2.0.0

From the version 2.0.0, as you can see here, FeatureImportances is available for Random Forest.

In fact, you can find here that:

The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.

Users can find more information about ensemble algorithms in the MLlib Ensemble guide. In this section, we demonstrate the DataFrame API for ensembles.

The main differences between this API and the original MLlib ensembles API are:

support for DataFrames and ML Pipelines

separation of classification vs. regression

use of DataFrame metadata to distinguish continuous and categorical features

more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification.

If you want to have Feature Importance values, you have to work with ml package, not mllib, and use dataframes.

Below there is an example that you can find here:

# IMPORT
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.classification import RandomForestClassifier

# PREPARE DATA
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)

# BUILD THE MODEL
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)

# FEATURE IMPORTANCES
>>> model.featureImportances
SparseVector(1, {0: 1.0})

164

answered Sep 22 '22 15:09

titiro89

I have to disappoint you, but feature importances in MLlib implementation of RandomForest are just not calculated, so you cannot get them from anywhere except by by implementing their calculation on your own.

Here's how to find it out:

You call a function RandomForest.trainClassifier deinfed here https://github.com/apache/spark/blob/branch-1.3/python/pyspark/mllib/tree.py

It calls for callMLlibFunc("trainRandomForestModel", ...), which is a call to Scala function RandomForest.trainClassifier or RandomForest.trainRegressor (depending on the algo), which return you RandomForestModel object.

This object is described in https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala and is extending TreeEnsembleModel defined in the same source file. And unfortunately this class stores only algorithm (regression or classification), trees themselves, relative weights of the trees and combining strategy (sum, avg, vote). It does not store feature importances, unfortunately, and does not even calculate them (see https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala for the calculation algorithm)

answered Sep 19 '22 15:09

0x0FFF

Related questions
                            
                                How to compute cumulative sum using Spark
                            
                                Why does spark-submit fail with "IllegalArgumentException: Missing application resource."?
                            
                                How to start and stop spark Context Manually
                            
                                parallelize() method in SparkContext
                            
                                What is the differences between Apache Spark and Apache Apex?
                            
                                Pyspark - Load file: Path does not exist
                            
                                How to transpose an RDD in Spark
                            
                                Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
                            
                                Is it possible to access estimator attributes in spark.ml pipelines?
                            
                                AWS EMR - IntelliJ Remote Debugging Spark Application
                            
                                What is the maximum size for a broadcast object in Spark?
                            
                                Trying to use map on a Spark DataFrame
                            
                                what is difference between SparkSession and SparkContext? [duplicate]
                            
                                Usage of spark DataFrame "as" method
                            
                                Splitting a row in a PySpark Dataframe into multiple rows
                            
                                How can I calculate exact median with Apache Spark?
                            
                                What is an optimized way of joining large tables in Spark SQL
                            
                                Where is the reference for options for writing or reading per format?
                            
                                Spark SQL nested withColumn
                            
                                Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark & MLLib: Random Forest Feature Importances

Tags:

apache-spark

pyspark

random-forest

apache-spark-mllib

Bryan

People also ask

2 Answers

titiro89

0x0FFF

Recent Activity

Donate For Us