Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: feature_names mismatch: in xgboost in the predict() function

I have trained an XGBoostRegressor model. When I have to use this trained model for predicting for a new input, the predict() function throws a feature_names mismatch error, although the input feature vector has the same structure as the training data.

Also, in order to build the feature vector in the same structure as the training data, I am doing a lot inefficient processing such as adding new empty columns (if data does not exist) and then rearranging the data columns so that it matches with the training structure. Is there a better and cleaner way of formatting the input so that it matches the training structure?

like image 907
Sujay S Kumar Avatar asked Feb 20 '17 07:02

Sujay S Kumar


5 Answers

This is the case where the order of column-names while model building is different from order of column-names while model scoring.

I have used the following steps to overcome this error

First load the pickle file

model = pickle.load(open("saved_model_file", "rb"))

extraxt all the columns with order in which they were used

cols_when_model_builds = model.get_booster().feature_names

reorder the pandas dataframe

pd_dataframe = pd_dataframe[cols_when_model_builds]
like image 158
Athar Avatar answered Nov 06 '22 09:11

Athar


Try converting data into ndarray before passing it to fit/predict. For eg: if your train data is train_df and test data is test_df. Use below code:

train_x = train_df.values
test_x = test_df.values

Now fit the model:

xgb.fit(train_x,train_y)

Finally, predict:

pred = xgb.predict(test_x)

Hope this helps!

like image 43
saurabh kumar Avatar answered Nov 06 '22 09:11

saurabh kumar


I also had this problem when i used pandas DataFrame (non-sparse representation).

I converted training and testing data into numpy ndarray.

          `X_train = X_train.as_matrix()
           X_test = X_test.as_matrix()` 

This how i got rid of that Error!

like image 9
Abhishek Sharma Avatar answered Nov 06 '22 09:11

Abhishek Sharma


From what I could find, the predict function does not take the DataFrame (or a sparse matrix) as input. It is one of the bugs which can be found here https://github.com/dmlc/xgboost/issues/1238

In order to get around this issue, use as_matrix() function in case of a DataFrame or toarray() in case of a sparse matrix.

This is the only workaround till the bug is fixed or the feature is implemented in a different manner.

like image 8
Sujay S Kumar Avatar answered Nov 06 '22 08:11

Sujay S Kumar


I came across the same problem and it's been solved by adding passing the train dataframe column name to the test dataframe via adding the following code:

test_df = test_df[train_df.columns]
like image 6
CathyQian Avatar answered Nov 06 '22 10:11

CathyQian