I'm working with scikit learn on a text classification experiment. Now I would like to get the names of the best performing, selected features. I tried some of the answers to similar questions, but nothing works. The last lines of code are an example of what I tried. For example when I print feature_names
, I get this error: sklearn.exceptions.NotFittedError: This SelectKBest instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
Any solutions?
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
feature_names = pipe.named_steps['mutual_info']
X.columns[features.transform(np.arange(len(X.columns)))]
'make_pipeline' is a utility function that is a shorthand for constructing pipelines. It takes a variable number of estimates and returns a pipeline by filling the names automatically.
NotFittedError[source] Exception class to raise if estimator is used before fitting. This class inherits from both ValueError and AttributeError to help with exception handling and backward compatibility.
You first have to fit the pipeline and then call feature_names
:
Solution
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
# Now fit the pipeline using your data
pipe.fit(X, y)
#now can the pipe.named_steps
feature_names = pipe.named_steps['mutual_info']
X.columns[features.transform(np.arange(len(X.columns)))]
General information
From the documentation example here you can see the
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
This sets some initial parameters (k parameter for anova and C parameter for svc)
and then calls fit(X,y)
to fit the pipeline.
EDIT:
for the new error, since your X is a list of dictionaries I see one way to call the columns method that you want. This can be done using pandas.
X= [{'age': 10, 'name': 'Tom'}, {'age': 5, 'name': 'Mark'}]
df = DataFrame(X)
len(df.columns)
result:
2
Hope this helps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With