Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xgboost predict method returns the same predicted value for all rows

I've created an xgboost classifier in Python:

train is a pandas dataframe with 100k rows and 50 features as columns. target is a pandas series

xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0, 
                                   objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)

predictions = xgb_classifier.predict(test)

However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?

Data clarification: ~50 numerical features with a numerical target

I've also tried RandomForestRegressor from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?

like image 873
mistakeNot Avatar asked Nov 02 '15 03:11

mistakeNot


People also ask

How does XGBoost predict?

There are 2 predictors in XGBoost (3 if you have the one-api plugin enabled), namely cpu_predictor and gpu_predictor . The default option is auto so that XGBoost can employ some heuristics for saving GPU memory during training. They might have slight different outputs due to floating point errors.

What is the output of XGBoost?

Output is a 4-dim array, with (rows, groups, columns + 1, columns + 1) as shape. Like the predict contribution case, whether approx_contribs is used does not change the output shape. If strict shape is set to False, it can have 3 or 4 dims depending on the underlying model.

What is the initial prediction for all observations in XGBoost?

“XGBoost” starts by making an initial prediction of 1 for each of the sample data points.

What is DMatrix in XGBoost?

DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data. Parameters. data (os. PathLike/string/numpy.


2 Answers

This question has received several responses including on this thread as well as here and here.

I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.

I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.

None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included: 1) check if gamma is too high 2) make sure your target labels are not included in your training dataset 3) max_depth may be too small.

like image 91
Blane Avatar answered Oct 20 '22 04:10

Blane


One of the reasons for the same is that you're providing a high penalty through parameter gamma. Compare the mean value of your training response variable and check if the prediction is close to this. If yes then the model is restricting too much on the prediction to keep train-rmse and val-rmse as close as possible. Your prediction is the simplest with higher value of gamma. So you'll get the simplest model prediction like mean of training set as prediction or naive prediction.

like image 28
Shahidur Avatar answered Oct 20 '22 04:10

Shahidur