Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: could not convert string to float: ' '. Is Permutation importance only applicable for numeric features?

I've a Data frame that contain dtypes as categorical, float, int.
X - contain features of all the three given dtypes and y is int.
I've created a pipline as given below.

get_imputer():
    imputing function

get_encoder():
    some encoder function

#model 

pipeline = Pipeline(steps=[
        ('imputer', get_imputer()),
        ('encoder', get_encoder()),
        ('regressor', RandomForestRegressor())
    ])

I needed to find permutation importance of the model. below is the code for that.

import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(pipeline.steps[2][1], random_state=1).fit(X, y)
eli5.show_weights(perm)

But this code is throwing an error as follows:

ValueError: could not convert string to float: ''
like image 210
Nayana Madhu Avatar asked May 10 '26 22:05

Nayana Madhu


2 Answers

Let's understand the working of PermutationImportance in short.

After you have trained your model with all the features, PermutationImportance shuffles values of column/s and checks the effect on Loss function.

Eg.

There are 5 features(columns) and there are n rows:

f1 f2 f3 f4 f5

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 . . . vt . . . .

Now to identify whether f3 column is important or not, it shuffles values in column f3. Eg. Value of f3 in row x is swapped with the value of f3 in row y, then it checks the effect on the loss function. And hence, identifies the importance of a feature in a model.

Now, to answer this particular question, I would say that any model is trained when all the features are numerical(as ML model does not understand text directly). So, in you PermutionImportance argument, you need to supply columns that are numbers. As you have trained a model after converting categorical/textual things in numbers, you need to apply the same conversion strategy to your new input.

Hence, PermuationImportance should be used only when your data is pre-processed and your dataframe has everything numerical.

like image 129
YoungSheldon Avatar answered May 12 '26 13:05

YoungSheldon


For the next poor soul...

I came across this post while having the same problem. While the accepted answer makes total sense - the fact is that in the OP's pipeline, it appears as though he is handling the categorical data with encoders which will convert them to numeric.

So, it appears that PermutationImportance is checking the array for numeric way too early in the process (before the pipeline entirely). Instead, it should check after the preprocessing steps and right before fitting the model. This is frustrating because if it doesn't work with pipelines it makes it hard to use.

I started off having some luck using sklearn's implementation of permutation_importance instead... But then I figured it out.

You need to separate the pipeline again and you should be able to get it to work. It's annoying, but it works!

import eli5
from eli5.sklearn import PermutationImportance

estimator = pipeline.named_steps['regressor']

# I didnt have multiple steps when I did it, but maybe this is right?
preprocessor = pipeline.named_steps['imputer']['encoder']

X2 = preprocessor.transform(X)

perm = PermutationImportance(estimator, random_state=1).fit(X2.toarray(), y)
eli5.show_weights(perm)
like image 30
Josh Avatar answered May 12 '26 13:05

Josh