AUC-base Features Importance using Random Forest

Tags:

I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).

The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).

The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.

My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?

PS : This question is kind of linked with an other.

778

asked Jul 08 '15 09:07

gowithefloww

2 Answers

scoring is just a performance evaluation tool used in test sample, and it does not enter into the internal DecisionTreeClassifier algo at each split node. You can only specify the criterion (kind of internal loss function at each split node) to be either gini or information entropy for the tree algo.

scoring can be used in a cross-validation context where the goal is to tune some hyperparameters (like max_depth). In your case, you can use a GridSearchCV to tune some of your hyperparameters using the scoring function roc_auc.

157

answered Oct 25 '22 07:10

Jianxun Li

After doing some researchs, this is what I came out with :

from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict

names = db_train.iloc[:,1:].columns.tolist()

# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
                                 class_weight="auto",
                                 criterion='gini',
                                 bootstrap=True,
                                 max_features=10,
                                 min_samples_split=1,
                                 min_samples_leaf=6,
                                 max_depth=3,
                                 n_jobs=-1)
scores = defaultdict(list)

# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))

for i in range(X_train.shape[1]):
    X_t = X_test.copy()
    np.random.shuffle(X_t[:, i])
    shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
    scores[names[i]].append((acc-shuff_acc)/acc)

print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True))

Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]

The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.

I took my inspiration from this (great) notebook.

All suggestions/comments are most welcome !

answered Oct 25 '22 09:10

gowithefloww

Related questions
                            
                                Hausdorff distance between 3D grids
                            
                                Script with scipy using py2exe
                            
                                Python Scapy vs dpkt
                            
                                How to make a scrolling menu in python-curses
                            
                                How to add capital to django-cities-light country model?
                            
                                Using subprocess.check_output for a command with 2>/dev/null
                            
                                Pandas/Python Combine two data frames with duplicate rows
                            
                                How to solve import error for pandas using iPython Notebook on Windows?
                            
                                How can I evaluate a list of strings as a list of tuples in Python?
                            
                                Newick tree representation to scipy.cluster.hierarchy linkage matrix format
                            
                                saving a dataframe to JSON file on local drive in pyspark
                            
                                Set dynamic node shape in network with matplotlib
                            
                                Unsupported format character?
                            
                                Python : Reading Large Excel Worksheets using Openpyxl
                            
                                Using Selenium on Raspberry Pi with Chromium
                            
                                Swap R and B color channel values in a directory of images? Python
                            
                                How to find Local maxima in Kernel Density Estimation?
                            
                                MemoryError's message as str is empty in Python
                            
                                Make a functional field editable in Openerp?
                            
                                Converting some columns from pandas dataframe to list of lists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AUC-base Features Importance using Random Forest

Tags:

python

machine-learning

scikit-learn

scoring

gowithefloww

People also ask

2 Answers

Jianxun Li

gowithefloww

Recent Activity

Donate For Us