Just wondering how is possible next case:
def fit(self, train, target):
xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
self.model = xgb.train(self.params, xgtrain, self.num_rounds)
I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.
Later on predict step, I got an error as cause of above bug :(
def predict(self, test):
if not self.model:
return -1
xgtest = xgb.DMatrix(test)
return self.model.predict(xgtest)
Error: ... training data did not have the following fields: f5232
How can I guarantee correct converting my train/test datasets to DMatrix?
Are there any chance to use in Python something similar to R?
# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]
My approach doesn't work, unfortunately:
def _normalize_columns(self):
columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
(set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
for item in columns:
if item in self.xgtest.feature_names:
self.xgtest.feature_names.remove(item)
else:
# seems, it's immutable structure and can not add any new item!!!
self.xgtest.feature_names.append(item)
One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:
Solutions:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With