Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XGBoost difference in train and test features after converting to DMatrix

Just wondering how is possible next case:

 def fit(self, train, target):
     xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
     self.model = xgb.train(self.params, xgtrain, self.num_rounds)

enter image description here I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.

Later on predict step, I got an error as cause of above bug :(

 def predict(self, test):
     if not self.model:
         return -1
     xgtest = xgb.DMatrix(test)
     return self.model.predict(xgtest)

enter image description here

Error: ... training data did not have the following fields: f5232

How can I guarantee correct converting my train/test datasets to DMatrix?

Are there any chance to use in Python something similar to R?

# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]

My approach doesn't work, unfortunately:

def _normalize_columns(self):
    columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
          (set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
    for item in columns:
        if item in self.xgtest.feature_names:
            self.xgtest.feature_names.remove(item)
        else:
            # seems, it's immutable structure and can not add any new item!!!
            self.xgtest.feature_names.append(item) 
like image 378
SpanishBoy Avatar asked Mar 12 '23 13:03

SpanishBoy


1 Answers

One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:

  1. If exclusive to training: Model object will have reference of this feature in model equation. While scoring it will throw error saying I am not able to find this column.
  2. If exclusive to test (lesser likely as test data is usually smaller than training data): Model object will NOT have reference of this feature in model equation. While scoring it will throw error saying I got this column but model equation don't have this column. This is also lesser likely because most implementations are cognizant of this case.

Solutions:

  1. The best "automated" solution is to keep only those columns, which are common to both training and test post one hot encoding.
  2. For adhoc analysis if you can not afford to drop the level of feature because of its importance then do stratified sampling to ensure that all level of feature gets distributed to training and test data.
like image 166
abhiieor Avatar answered Mar 19 '23 07:03

abhiieor