Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark randomForest feature importance: how to get column names from the column numbers

I am using the standard (string indexer + one hot encoder + randomForest) pipeline in spark, as shown below

labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data)

string_feature_indexers = [
   StringIndexer(inputCol=x, outputCol="int_{0}".format(x)).fit(data)
   for x in char_col_toUse_names
]

onehot_encoder = [
   OneHotEncoder(inputCol="int_"+x, outputCol="onehot_{0}".format(x))
   for x in char_col_toUse_names
]
all_columns = num_col_toUse_names + bool_col_toUse_names + ["onehot_"+x for x in char_col_toUse_names]
assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer] + string_feature_indexers + onehot_encoder + [assembler, rf, labelConverter])

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)
cvModel = crossval.fit(trainingData)

now after the the fit I can get the random forest and the feature importance using cvModel.bestModel.stages[-2].featureImportances, but this does not give me feature/ column names, rather just the feature number.

What I get is below:

print(cvModel.bestModel.stages[-2].featureImportances)

(1446,[3,4,9,18,20,103,766,981,983,1098,1121,1134,1148,1227,1288,1345,1436,1444],[0.109898803421,0.0967396441648,4.24568235244e-05,0.0369705839109,0.0163489685127,3.2286694534e-06,0.0208192703688,0.0815822887175,0.0466903663708,0.0227619959989,0.0850922269211,0.000113388896956,0.0924779490403,0.163835022713,0.118987129392,0.107373548367,3.35577640585e-05,0.000229569946193])

How can I map it back to some column names or column name + value format?
Basically to get the feature importance of random forest along with the column names.

like image 980
Abhishek Avatar asked Jul 11 '17 02:07

Abhishek


2 Answers

The transformed dataset metdata has the required attributes.Here is an easy way to do -

  1. create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)

    pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
    ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
    
  2. Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.

    feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 
    
    feature_dict_broad = sc.broadcast(feature_dict)
    

You can also look here and here

like image 79
aamirr Avatar answered Oct 19 '22 15:10

aamirr


Hey why don't you just map it back to the original columns through list expansion. Here is an example:

# in your case: trainingData.columns 
data_frame_columns = ["A", "B", "C", "D", "E", "F"]
# in your case: print(cvModel.bestModel.stages[-2].featureImportances)
feature_importance = (1, [1, 3, 5], [0.5, 0.5, 0.5])

rf_output = [(data_frame_columns[i], feature_importance[2][j]) for i, j in zip(feature_importance[1], range(len(feature_importance[2])))]
dict(rf_output)

{'B': 0.5, 'D': 0.5, 'F': 0.5}
like image 27
Dat Tran Avatar answered Oct 19 '22 14:10

Dat Tran