I've built a pipeline in Scikit-Learn with two steps: one to construct features, and the second is a RandomForestClassifier.
While I can save that pipeline, look at various steps and the various parameters set in the steps, I'd like to be able to examine the feature importances from the resulting model.
Is that possible?
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
Probably the easiest way to examine feature importances is by examining the model's coefficients. For example, both linear and logistic regression boils down to an equation in which coefficients (importances) are assigned to each input value.
Ah, yes it is.
You list identify the step where you want to check the estimator:
For instance:
pipeline.steps[1]
Which returns:
('predictor',
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=2,
oob_score=False, random_state=None, verbose=0,
warm_start=False))
You can then access the model step directly:
pipeline.steps[1][1].feature_importances_
I wrote an article on doing this in general you can find here.
In general for a pipeline you can access the named_steps
parameter. This will give you each transformer in a pipeline. So for example for this pipeline:
model = Pipeline(
[
("vectorizer", CountVectorizer()),
("transformer", TfidfTransformer()),
("classifier", classifier),
])
we could access the individual feature steps by doing model.named_steps["transformer"].get_feature_names()
This will return the list of feature names from the TfidfTransformer
. This is all fine and good but doesn't really cover many use cases since we normally want to combine a few features. Take this model for example:
model = Pipeline([
("union", FeatureUnion(transformer_list=[
("h1", TfidfVectorizer(vocabulary={"worst": 0})),
("h2", TfidfVectorizer(vocabulary={"best": 0})),
("h3", TfidfVectorizer(vocabulary={"awful": 0})),
("tfidf_cls", Pipeline([
("vectorizer", CountVectorizer()),
("transformer", TfidfTransformer())
]
))
])
),
("classifier", classifier)])
Here we combine a few features using a feature union and a subpipeline. To access these features we'd need to explicitly call each named step in order. For example getting the TF-IDF features from the internal pipeline we'd have to do:
model.named_steps["union"].tranformer_list[3][1].named_steps["transformer"].get_feature_names()
That's kind of a headache but it is doable. Usually what I do is use a variation of the following snippet to get it. The below code just treats sets of pipelines/feature unions as a tree and performs DFS combining the feature_names as it goes.
from sklearn.pipeline import FeatureUnion, Pipeline
def get_feature_names(model, names: List[str], name: str) -> List[str]:
"""Thie method extracts the feature names in order from a Sklearn Pipeline
This method only works with composed Pipelines and FeatureUnions. It will
pull out all names using DFS from a model.
Args:
model: The model we are interested in
names: The list of names of final featurizaiton steps
name: The current name of the step we want to evaluate.
Returns:
feature_names: The list of feature names extracted from the pipeline.
"""
# Check if the name is one of our feature steps. This is the base case.
if name in names:
# If it has the named_steps atribute it's a pipeline and we need to access the features
if hasattr(model, "named_steps"):
return extract_feature_names(model.named_steps[name], name)
# Otherwise get the feature directly
else:
return extract_feature_names(model, name)
elif type(model) is Pipeline:
feature_names = []
for name in model.named_steps.keys():
feature_names += get_feature_names(model.named_steps[name], names, name)
return feature_names
elif type(model) is FeatureUnion:
feature_names= []
for name, new_model in model.transformer_list:
feature_names += get_feature_names(new_model, names, name)
return feature_names
# If it is none of the above do not add it.
else:
return []
You'll also need this method. Which operates on individual transformations, things like the TfidfVectorizer, to get the names. In SciKit-Learn there isn't a universal get_feature_names
so you have to kind of fudge it for each different case. This is my attempt at doing something reasonable for most use cases.
def extract_feature_names(model, name) -> List[str]:
"""Extracts the feature names from arbitrary sklearn models
Args:
model: The Sklearn model, transformer, clustering algorithm, etc. which we want to get named features for.
name: The name of the current step in the pipeline we are at.
Returns:
The list of feature names. If the model does not have named features it constructs feature names
by appending an index to the provided name.
"""
if hasattr(model, "get_feature_names"):
return model.get_feature_names()
elif hasattr(model, "n_clusters"):
return [f"{name}_{x}" for x in range(model.n_clusters)]
elif hasattr(model, "n_components"):
return [f"{name}_{x}" for x in range(model.n_components)]
elif hasattr(model, "components_"):
n_components = model.components_.shape[0]
return [f"{name}_{x}" for x in range(n_components)]
elif hasattr(model, "classes_"):
return classes_
else:
return [name]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With