Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpreting feature importance values from a RandomForestClassifier

I am a beginner when it comes to machine learning, and I'm having trouble interpreting some of the results I'm getting from my first program. Here's the setup:

I have a dataset of book reviews. These books can be tagged with any number of qualifiers from a set of about 1600. The people reviewing these books can also tag themselves with these qualifiers to indicate that they like to read things with that tag.

The dataset has a column for each qualifier. For every review, if a given qualifier is used to tag both the book and the reviewer a value of 1 is recorded. If there is not a "match" for a given qualifier on a given review, a value of 0 is recorded.

There is also a "Score" column, which holds an integer 1-5 for each review (the "star rating" of that review). My goal is to determine what features are most important to getting a high score.

Here's the code I have right now (https://gist.github.com/souldeux/99f71087c712c48e50b7):

def determine_feature_importance(df):
    #Determines the importance of individual features within a dataframe
    #Grab header for all feature values excluding score & ids
    features_list = df.columns.values[4::]
    print "Features List: \n", features_list

    #set X equal to all feature values, excluding Score & ID fields
    X = df.values[:,4::]

    #set y equal to all Score values
    y = df.values[:,0]

    #fit a random forest with near-default paramaters to determine feature importance
    print '\nCreating Random Forest Classifier...\n'
    forest = RandomForestClassifier(oob_score=True, n_estimators=10000)
    print '\nFitting Random Forest Classifier...\n'
    forest.fit(X,y)
    feature_importance = forest.feature_importances_
    print feature_importance

    #Make importances relative to maximum importance
    print "\nMaximum feature importance is currently: ", feature_importance.max()
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    print "\nNormalized feature importance: \n", feature_importance
    print "\nNormalized maximum feature importance: \n", feature_importance.max()
    print "\nTo do: set fi_threshold == max?"
    print "\nTesting: setting fi_threshhold == 1"
    fi_threshold=1

    #get indicies of all features over fi_threshold
    important_idx = np.where(feature_importance > fi_threshold)[0]
    print "\nRetrieved important_idx: ", important_idx

    #create a list of all feature names above fi_threshold
    important_features = features_list[important_idx]
    print "\n", important_features.shape[0], "Important features(>", fi_threshold, "% of max importance:\n", important_features

    #get sorted indices of important features
    sorted_idx = np.argsort(feature_importance[important_idx])[::-1]
    print "\nFeatures sorted by importance (DESC):\n", important_features[sorted_idx]

    #generate plot
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1,2,2)
    plt.barh(pos,feature_importance[important_idx][sorted_idx[::-1]],align='center')
    plt.yticks(pos, important_features[sorted_idx[::-1]])
    plt.xlabel('Relative importance')
    plt.ylabel('Variable importance')
    plt.draw()
    plt.show()

    X = X[:, important_idx][:, sorted_idx]


    return "Feature importance determined"

I am successfully generating a plot, but I am honestly not sure what the plot means. As I understand it, this is showing me how strongly any given feature impacts the score variable. But, and I realize this must be a stupid question, how do I know if the impact is positive or negative?

like image 799
souldeux Avatar asked Nov 20 '15 22:11

souldeux


People also ask

How do you interpret a feature important score?

Feature Importance refers to techniques that calculate a score for all the input features for a given model — the scores simply represent the “importance” of each feature. A higher score means that the specific feature will have a larger effect on the model that is being used to predict a certain variable.

How do you interpret a feature important in a decision tree?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

What is a good number of estimators in RandomForestClassifier?

Its always a good practice to be in limit of 200 to 300 estimators in RF if your data size is above 1 lakh rows, or use grid search and get the right number of estimators.


1 Answers

In short you do not. Decision trees (building block of random forest) do not work this way. If you work with linear models then there is quite simple distinction if feature is "positive" or "negative", because the only impact it can have on the final result is being added (with weight). Nothing more. However, ensemble of decision trees can have arbitrary complex rules for each feature, for example "if book has red cover and have more than 100 pages then if it contains dragons it gets high score" but "if book has blue cover and more than 100 pages then if it contains dragons it gets low score" and so on.

Feature importance only gives you notion which features contributes to the decision, not "which way", because sometimes it will work this, and sometimes the other way.

What you can do? You can add some extreme simplification - assume that you are only interested in feature in complete absence of others, and now - once you know which are important, you can compute how many times this feature is among each class (scores in your case). This way you will get the distribution

P(gets score X|has feature Y)

which will show you more or less if it has (after marginalization) positive or negative impact.

like image 113
lejlot Avatar answered Oct 25 '22 09:10

lejlot