I'm trying to use scikit learn in python to do a couple different classifier problems (RF, GBM, etc). In addition to building models and making predictions, I'd like to see variable importance. I know there is a way to get the importances
importances = clf.feature_importances_
print(importances)
but how do I get something more refined that has the importance connected to the variable name (ie summary(gbm)
in R or varImp(randomForest)
in R) especially if it's a categorical variable with multiple levels?
You can directly feed categorical variables to random forest using below approach: Firstly convert categories of feature to numbers using sklearn label encoder. Secondly convert label encoded feature type to string(object)
The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects. On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not what we want.
In python, unlike R, there is no option to represent categorical data as factors. Factors in R are stored as vectors of integer values and can be labelled. If we have our data in Series or Data Frames, we can convert these categories to numbers using pandas Series' astype method and specify 'categorical'.
One way is to encode all the category variable using the OneHotEncoding method (encode all the categorical class into numerical values 0 and 1, where 0 mean absent and 1 is present). This method is preferable by many as the information is still present and it is easy to understand the concept.
The variable importance (or feature importance) is calculated for all the features that you are fitting your model to. This pseudo code gives you an idea of how variable names and importance can be related:
import pandas as pd
train = pd.read_csv("train.csv")
cols = ['hour', 'season', 'holiday', 'workingday', 'weather', 'temp', 'windspeed']
clf = YourClassifiers()
clf.fit(train[cols], train.targets) # targets/labels
print len(clf.feature_importances_)
print len(cols)
You will see that the lengths of the two lists being printed are the same - you can essentially map the lists together or manipulate them how you wish. If you'd like to show variable importance nicely in a plot, you could use this:
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(6 * 1.618, 6))
index = np.arange(len(cols))
bar_width = 0.35
plt.bar(index, clf.feature_importances_, color='black', alpha=0.5)
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Feature importance')
plt.xticks(index + bar_width, cols)
plt.tight_layout()
plt.show()
If you don't want to use this method (meaning that you are fitting all columns, not just selected few as set in cols
variable), then you could get the column/feature/variable names of your data with train.columns.values
(and then map this list together with variable importance list or manipulate in some other way).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With