Interpreting feature importance values from a RandomForestClassifier

Tags:

I am a beginner when it comes to machine learning, and I'm having trouble interpreting some of the results I'm getting from my first program. Here's the setup:

I have a dataset of book reviews. These books can be tagged with any number of qualifiers from a set of about 1600. The people reviewing these books can also tag themselves with these qualifiers to indicate that they like to read things with that tag.

The dataset has a column for each qualifier. For every review, if a given qualifier is used to tag both the book and the reviewer a value of 1 is recorded. If there is not a "match" for a given qualifier on a given review, a value of 0 is recorded.

There is also a "Score" column, which holds an integer 1-5 for each review (the "star rating" of that review). My goal is to determine what features are most important to getting a high score.

Here's the code I have right now (https://gist.github.com/souldeux/99f71087c712c48e50b7):

def determine_feature_importance(df):
    #Determines the importance of individual features within a dataframe
    #Grab header for all feature values excluding score & ids
    features_list = df.columns.values[4::]
    print "Features List: \n", features_list

    #set X equal to all feature values, excluding Score & ID fields
    X = df.values[:,4::]

    #set y equal to all Score values
    y = df.values[:,0]

    #fit a random forest with near-default paramaters to determine feature importance
    print '\nCreating Random Forest Classifier...\n'
    forest = RandomForestClassifier(oob_score=True, n_estimators=10000)
    print '\nFitting Random Forest Classifier...\n'
    forest.fit(X,y)
    feature_importance = forest.feature_importances_
    print feature_importance

    #Make importances relative to maximum importance
    print "\nMaximum feature importance is currently: ", feature_importance.max()
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    print "\nNormalized feature importance: \n", feature_importance
    print "\nNormalized maximum feature importance: \n", feature_importance.max()
    print "\nTo do: set fi_threshold == max?"
    print "\nTesting: setting fi_threshhold == 1"
    fi_threshold=1

    #get indicies of all features over fi_threshold
    important_idx = np.where(feature_importance > fi_threshold)[0]
    print "\nRetrieved important_idx: ", important_idx

    #create a list of all feature names above fi_threshold
    important_features = features_list[important_idx]
    print "\n", important_features.shape[0], "Important features(>", fi_threshold, "% of max importance:\n", important_features

    #get sorted indices of important features
    sorted_idx = np.argsort(feature_importance[important_idx])[::-1]
    print "\nFeatures sorted by importance (DESC):\n", important_features[sorted_idx]

    #generate plot
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1,2,2)
    plt.barh(pos,feature_importance[important_idx][sorted_idx[::-1]],align='center')
    plt.yticks(pos, important_features[sorted_idx[::-1]])
    plt.xlabel('Relative importance')
    plt.ylabel('Variable importance')
    plt.draw()
    plt.show()

    X = X[:, important_idx][:, sorted_idx]


    return "Feature importance determined"

I am successfully generating a plot, but I am honestly not sure what the plot means. As I understand it, this is showing me how strongly any given feature impacts the score variable. But, and I realize this must be a stupid question, how do I know if the impact is positive or negative?

799

asked Nov 20 '15 22:11

souldeux

1 Answers

In short you do not. Decision trees (building block of random forest) do not work this way. If you work with linear models then there is quite simple distinction if feature is "positive" or "negative", because the only impact it can have on the final result is being added (with weight). Nothing more. However, ensemble of decision trees can have arbitrary complex rules for each feature, for example "if book has red cover and have more than 100 pages then if it contains dragons it gets high score" but "if book has blue cover and more than 100 pages then if it contains dragons it gets low score" and so on.

Feature importance only gives you notion which features contributes to the decision, not "which way", because sometimes it will work this, and sometimes the other way.

What you can do? You can add some extreme simplification - assume that you are only interested in feature in complete absence of others, and now - once you know which are important, you can compute how many times this feature is among each class (scores in your case). This way you will get the distribution

P(gets score X|has feature Y)

which will show you more or less if it has (after marginalization) positive or negative impact.

113

answered Oct 25 '22 09:10

lejlot

Related questions
                            
                                How can I send text when the window is minimized?
                            
                                How to plot grad(f(x,y))?
                            
                                Limiting number of input values in an array/list in Python
                            
                                Run Celery Worker from FLASK app
                            
                                understanding '*' "keyword only" argument notation in python3 functions [duplicate]
                            
                                How to delete a locked (flock) file without race condition: before or after releasing the lock?
                            
                                SymPy -- define domain of variable
                            
                                Python/tkinter - How do I get the window size including borders on Windows?
                            
                                Keeping rows in Pandas where the same ID appears more than n times and convert to list per ID
                            
                                Pandas convert mixed types to string
                            
                                Copying one file to multiple remote hosts in parallel over SFTP
                            
                                Will dict(**kwargs) always give dictionary where Keys are of type string?
                            
                                Python Socket - Send/Receive messages at the same time
                            
                                Ack number to acknowledge data in scapy
                            
                                docker with pycharm 5
                            
                                Storage of timeseries data in python
                            
                                ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory
                            
                                Python 2D list slice
                            
                                Python: Converting from binary to String
                            
                                urllib3 - Failed to establish a new connection: [Errno 111]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Interpreting feature importance values from a RandomForestClassifier

Tags:

python

machine-learning

numpy

statistics

scikit-learn

souldeux

People also ask

1 Answers

lejlot

Recent Activity

Donate For Us