Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Spark's Feature Importance on Random Forest?

The documentation for Random Forests does not include feature importances. However, it is listed on the Jira as resolved and is in the source code. HERE also says "The main differences between this API and the original MLlib ensembles API are:

  • support for DataFrames and ML Pipelines
  • separation of classification vs. regression
  • use of DataFrame metadata to distinguish continuous and categorical features
  • more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification."

However, I cannot figure out a syntax that works to call this new feature.

scala> model
res13: org.apache.spark.mllib.tree.model.RandomForestModel = 
TreeEnsembleModel classifier with 10 trees

scala> model.featureImportances
<console>:60: error: value featureImportances is not a member of org.apache.spark.mllib.tree.model.RandomForestModel
              model.featureImportances
like image 737
Climbs_lika_Spyder Avatar asked Jan 05 '16 22:01

Climbs_lika_Spyder


1 Answers

You have to use the new Random Forests. Check your imports. The OLD:

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel

The NEW Random Forests use:

import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier

This S.O. answer provides code for extracting the importances.

This S.O. answer explains the sparse vector that is returned.

like image 175
Climbs_lika_Spyder Avatar answered Oct 13 '22 19:10

Climbs_lika_Spyder