Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Scikit find variable importance for categorical variables

I'm trying to use scikit learn in python to do a couple different classifier problems (RF, GBM, etc). In addition to building models and making predictions, I'd like to see variable importance. I know there is a way to get the importances

importances = clf.feature_importances_
print(importances)

but how do I get something more refined that has the importance connected to the variable name (ie summary(gbm) in R or varImp(randomForest) in R) especially if it's a categorical variable with multiple levels?

like image 835
screechOwl Avatar asked Mar 19 '15 23:03

screechOwl


People also ask

Can Sklearn handle categorical variables?

You can directly feed categorical variables to random forest using below approach: Firstly convert categories of feature to numbers using sklearn label encoder. Secondly convert label encoded feature type to string(object)

Can we use StandardScaler on categorical features?

The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects. On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not what we want.

How do you factor a categorical variable in Python?

In python, unlike R, there is no option to represent categorical data as factors. Factors in R are stored as vectors of integer values and can be labelled. If we have our data in Series or Data Frames, we can convert these categories to numbers using pandas Series' astype method and specify 'categorical'.

How do you do feature selection on categorical variables?

One way is to encode all the category variable using the OneHotEncoding method (encode all the categorical class into numerical values 0 and 1, where 0 mean absent and 1 is present). This method is preferable by many as the information is still present and it is easy to understand the concept.


1 Answers

The variable importance (or feature importance) is calculated for all the features that you are fitting your model to. This pseudo code gives you an idea of how variable names and importance can be related:

import pandas as pd

train = pd.read_csv("train.csv")
cols = ['hour', 'season', 'holiday', 'workingday', 'weather', 'temp', 'windspeed']
clf = YourClassifiers()
clf.fit(train[cols], train.targets) # targets/labels

print len(clf.feature_importances_)
print len(cols)

You will see that the lengths of the two lists being printed are the same - you can essentially map the lists together or manipulate them how you wish. If you'd like to show variable importance nicely in a plot, you could use this:

import numpy as np
import matplotlib.pyplot as plt

plt.figure(figsize=(6 * 1.618, 6))
index = np.arange(len(cols))
bar_width = 0.35
plt.bar(index, clf.feature_importances_, color='black', alpha=0.5)
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Feature importance')
plt.xticks(index + bar_width, cols)
plt.tight_layout()
plt.show()

If you don't want to use this method (meaning that you are fitting all columns, not just selected few as set in cols variable), then you could get the column/feature/variable names of your data with train.columns.values (and then map this list together with variable importance list or manipulate in some other way).

like image 190
kasparg Avatar answered Oct 26 '22 02:10

kasparg