feature_names mismach in xgboost despite having same columns

Question

I have training (X) and test data (test_data_process) set with the same columns and order, as indicated below:

enter image description here

But when I do

predictions = my_model.predict(test_data_process)

It gives the following error:

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34'] ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'YrMoSold'] expected f22, f25, f0, f34, f32, f5, f20, f3, f33, f15, f24, f31, f28, f9, f8, f19, f14, f18, f17, f2, f13, f4, f27, f16, f1, f29, f11, f26, f10, f7, f21, f30, f23, f6, f12 in input data training data did not have the following fields: OpenPorchSF, BsmtFinSF1, LotFrontage, GrLivArea, YrMoSold, FullBath, TotRmsAbvGrd, GarageCars, YearRemodAdd, BedroomAbvGr, PoolArea, KitchenAbvGr, LotArea, HalfBath, MiscVal, EnclosedPorch, BsmtUnfSF, MSSubClass, BsmtFullBath, YearBuilt, 1stFlrSF, ScreenPorch, 3SsnPorch, TotalBsmtSF, GarageYrBlt, MasVnrArea, OverallQual, Fireplaces, WoodDeckSF, 2ndFlrSF, BsmtFinSF2, BsmtHalfBath, LowQualFinSF, OverallCond, GarageArea

So it complains that the training data (X) does not have those fields, whereas it has.

How to solve this issue?

[UPDATE]:

My code:

X = data.select_dtypes(exclude=['object']).drop(columns=['Id'])
X['YrMoSold'] = X['YrSold'] * 12 + X['MoSold']
X = X.drop(columns=['YrSold', 'MoSold', 'SalePrice'])
X = X.fillna(0.0000001)

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size=0.2)

my_model = XGBRegressor(n_estimators=100, learning_rate=0.05, booster='gbtree')
my_model.fit(train_X, train_y, early_stopping_rounds=5, 
    eval_set=[(val_X, val_y)], verbose=False)

test_data_process = test_data.select_dtypes(exclude=['object']).drop(columns=['Id'])
test_data_process['YrMoSold'] = test_data_process['YrSold'] * 12 + test_data['MoSold']
test_data_process = test_data_process.drop(columns=['YrSold', 'MoSold'])
test_data_process = test_data_process.fillna(0.0000001)
test_data_process = test_data_process[X.columns]

predictions = my_model.predict(test_data_process)

epattaro · Accepted Answer

Thats an honest mistake.

When feeding your data you are using np arrays:

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size=0.2)

(X.values is a np.array)

which do not have column names defined

when entering the data set for prediction you are using a dataframe

you should use a numpy array, you can convert it by using:

predictions = my_model.predict(test_data_process.values)

(add .values)

Md. Sabbir Ahmed · Answer

I also faced the same problem and spent several hours in checking lots of Q&A of SO and GitHub. At last, the problem is solved :). I thank this response of ianozsvald who mentioned that we have to pass numpy array at the start.

In my case, when I was working on the XGBoost separately (when I did not include it as a base learner in the Stacking classifier), no problem was created. However, when multiple base learners including the XGBoost was included in the Stacking classifier and when I was trying to call KernelExplainer of SHAPley Additive Explanations for explaining Stacking classifier, I got the error.

Here is how I solved the problem.

First, I changed the train_x_df to train_x_df.values while fitting the Stacking classifier.
Second, I changed train_x_df to train_x_df.values and passed it as data of KernelExplainer.

In a sentence, to solve the problem, everywhere, we have to use numpy representation of the dataframe (can be done by property .values). Please, remember, executing only the 2nd step does not work (at least in my case) as still, it gets the mismatch.

feature_names mismach in xgboost despite having same columns

Tags:

python

xgboost

rcs

2 Answers

epattaro

Md. Sabbir Ahmed

Recent Activity

Donate For Us

feature_names mismach in xgboost despite having same columns

Tags:

python

xgboost

rcs

2 Answers

epattaro

Md. Sabbir Ahmed

Related questions

Recent Activity

Donate For Us