Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random Forest Feature Importance Chart using Python

I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used:

from sklearn.ensemble import RandomForestRegressor  MT= pd.read_csv("MT_reduced.csv")  df = MT.reset_index(drop = False)  columns2 = df.columns.tolist()  # Filter the columns to remove ones we don't want. columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]]  # Store the variable we'll be predicting on. target = "Property_crime_rate"  # Let’s randomly split our data with 80% as the train set and 20% as the test set:  # Generate the training set.  Set random_state to be able to replicate results. train2 = df.sample(frac=0.8, random_state=1)  #exclude all obs with matching index test2 = df.loc[~df.index.isin(train2.index)]  print(train2.shape) #need to have same number of features only difference should be obs print(test2.shape)  # Initialize the model with some parameters.  model = RandomForestRegressor(n_estimators=100, min_samples_leaf=8, random_state=1)  #n_estimators= number of trees in forrest #min_samples_leaf= min number of samples at each leaf   # Fit the model to the data. model.fit(train2[columns2], train2[target]) # Make predictions. predictions_rf = model.predict(test2[columns2]) # Compute the error. mean_squared_error(predictions_rf, test2[target])#650.4928 

Feature Importance

features=df.columns[[3,4,6,8,9,10]] importances = model.feature_importances_ indices = np.argsort(importances)  plt.figure(1) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='b', align='center') plt.yticks(range(len(indices)), features[indices]) plt.xlabel('Relative Importance') 

This feature importance code was altered from an example found on http://www.agcross.com/2015/02/random-forests-in-python-with-scikit-learn/

I receive the following error when I attempt to replicate the code with my data:

  IndexError: index 6 is out of bounds for axis 1 with size 6 

Also, only one feature shows up on my chart with 100% importance where there are no labels.

Any help solving this issue so I can create this chart will be greatly appreciated.

like image 700
user348547 Avatar asked May 21 '17 20:05

user348547


People also ask

How do you check the feature important in random forest?

We can measure how each feature decrease the impurity of the split (the feature with highest decrease is selected for internal node). For each feature we can collect how on average it decreases the impurity. The average over all trees in the forest is the measure of the feature importance.

What is importance in random forest?

Random forest adds additional randomness to the model, while growing the trees. Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. This results in a wide diversity that generally results in a better model.


2 Answers

Here is an example using the iris data set.

>>> from sklearn.datasets import load_iris >>> iris = load_iris() >>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42) >>> rnd_clf.fit(iris["data"], iris["target"]) >>> for name, importance in zip(iris["feature_names"], rnd_clf.feature_importances_): ...     print(name, "=", importance)  sepal length (cm) = 0.112492250999 sepal width (cm) = 0.0231192882825 petal length (cm) = 0.441030464364 petal width (cm) = 0.423357996355 

Plotting feature importance

>>> features = iris['feature_names'] >>> importances = rnd_clf.feature_importances_ >>> indices = np.argsort(importances)  >>> plt.title('Feature Importances') >>> plt.barh(range(len(indices)), importances[indices], color='b', align='center') >>> plt.yticks(range(len(indices)), [features[i] for i in indices]) >>> plt.xlabel('Relative Importance') >>> plt.show() 

Feature importances

like image 61
spies006 Avatar answered Oct 03 '22 21:10

spies006


Load the feature importances into a pandas series indexed by your column names, then use its plot method. e.g. for an sklearn RF classifier/regressor model trained using df:

feat_importances = pd.Series(model.feature_importances_, index=df.columns) feat_importances.nlargest(4).plot(kind='barh') 

enter image description here

like image 32
fordy Avatar answered Oct 03 '22 20:10

fordy