Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

feature names from sklearn pipeline: not fitted error

I'm working with scikit learn on a text classification experiment. Now I would like to get the names of the best performing, selected features. I tried some of the answers to similar questions, but nothing works. The last lines of code are an example of what I tried. For example when I print feature_names, I get this error: sklearn.exceptions.NotFittedError: This SelectKBest instance is not fitted yet. Call 'fit' with appropriate arguments before using this method. Any solutions?

scaler = StandardScaler(with_mean=False) 

enc = LabelEncoder()
y = enc.fit_transform(labels)

feat_sel = SelectKBest(mutual_info_classif, k=200)  
clf = linear_model.LogisticRegression()

pipe = Pipeline([('vectorizer', DictVectorizer()),
                 ('scaler', StandardScaler(with_mean=False)),
                 ('mutual_info', feat_sel),
                 ('logistregress', clf)])

feature_names = pipe.named_steps['mutual_info']
X.columns[features.transform(np.arange(len(X.columns)))]
like image 463
Bambi Avatar asked Jul 23 '17 16:07

Bambi


People also ask

Which functions creates a pipeline and automatically names each step so that we don't need to specify the names?

'make_pipeline' is a utility function that is a shorthand for constructing pipelines. It takes a variable number of estimates and returns a pipeline by filling the names automatically.

What is not fitted error?

NotFittedError[source] Exception class to raise if estimator is used before fitting. This class inherits from both ValueError and AttributeError to help with exception handling and backward compatibility.


1 Answers

You first have to fit the pipeline and then call feature_names:

Solution

scaler = StandardScaler(with_mean=False) 

enc = LabelEncoder()
y = enc.fit_transform(labels)

feat_sel = SelectKBest(mutual_info_classif, k=200)  
clf = linear_model.LogisticRegression()

pipe = Pipeline([('vectorizer', DictVectorizer()),
                 ('scaler', StandardScaler(with_mean=False)),
                 ('mutual_info', feat_sel),
                 ('logistregress', clf)])

# Now fit the pipeline using your data
pipe.fit(X, y)

#now can the pipe.named_steps
feature_names = pipe.named_steps['mutual_info']
X.columns[features.transform(np.arange(len(X.columns)))]

General information

From the documentation example here you can see the

anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)

This sets some initial parameters (k parameter for anova and C parameter for svc)

and then calls fit(X,y) to fit the pipeline.

EDIT:

for the new error, since your X is a list of dictionaries I see one way to call the columns method that you want. This can be done using pandas.

X= [{'age': 10, 'name': 'Tom'}, {'age': 5, 'name': 'Mark'}]

df = DataFrame(X) 
len(df.columns)

result:

2

Hope this helps

like image 90
seralouk Avatar answered Oct 04 '22 22:10

seralouk