Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does XGBoost with datasets of zeros return a non-zero prediction?

I recently developed a fully-functioning random forest regression SW with scikit-learn RandomForestRegressor model and now I'm interested in comparing its performance with other libraries. So I found a scikit-learn API for XGBoost random forest regression and I made a little SW test with an X feature and Y datasets of all zeros.

from numpy import array
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor


tree_number = 100
depth = 10
jobs = 1
dimension = 19
sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
                               n_jobs=jobs)
xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
                         n_jobs=jobs)
dataset = array([[0.0] * dimension, [0.0] * dimension])
y_val = array([0.0, 0.0])

sk_VAL.fit(dataset, y_val)
xgb_VAL.fit(dataset, y_val)
sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))
print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))

Surprisingly the prediction result with an input sample of all zeros for xgb_VAL model is non-zero:

sk_prediction = [0.]
xgb_prediction = [0.02500369]

What is the error in my evaluation or in construction of the comparison for which I have this result?

like image 810
gwanim Avatar asked Apr 16 '21 09:04

gwanim


People also ask

What is the output of XGBoost?

Output is a 4-dim array, with (rows, groups, columns + 1, columns + 1) as shape. Like the predict contribution case, whether approx_contribs is used does not change the output shape. If strict shape is set to False, it can have 3 or 4 dims depending on the underlying model.

Can XGBoost be used for prediction?

In this tutorial, you will discover how to develop and evaluate XGBoost regression models in Python. After completing this tutorial, you will know: XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.

Is XGBoost nonlinear?

“Xgboost” is one of the most powerful machine learning tools available for tabulated data. It's efficiency and performance in learning non linear decision boundaries have made it a staple in both industry and academia alike.

What are labels in XGBoost?

label is the outcome of our dataset meaning it is the binary classification we will try to predict. Let's discover the dimensionality of our datasets. This dataset is very small to not make the R package too heavy, however XGBoost is built to manage huge datasets very efficiently.

What is XGBoost objective function and base learners?

The validity of this statement can be inferred by knowing about its (XGBoost) objective function and base learners. The objective function contains loss function and a regularization term. It tells about the difference between actual values and predicted values, i.e how far the model results are from the real values.

Does XGBoost work for quantile regression?

Here, I present a customized cost-function for applying the well-known xgboost regressor to quantile regression. Xgboost or Extreme Gradient Boosting is a very succesful and powerful tree-based algorithm. Because of the nature of the Gradient and Hessian of the quantile regression cost-function, xgboost is known to heavily underperform.

What are the different types of loss functions in XGBoost?

The most common loss functions in XGBoost for regression problems is reg:linear, and that for binary classification is reg:logistics. Ensemble learning involves training and combining individual models (known as base learners) to get a single prediction, and XGBoost is one of the ensemble learning methods.

Does XGBoost have a tree-based problem?

In fact, this is a problem that affects not only XGBoost but all tree-based algorithms in general. This is perhaps the fundamental flaw inherent in all tree-based models. It doesn’t matter if you have a single decision tree, a random forest with 100 trees, or an XGBoost model with 1000 trees.


Video Answer


1 Answers

It seems that XGBoost includes a global bias in the model, and that this is fixed at 0.5 rather than being calculated based on the input data. This has been raised as an issue in the XGBoost GitHub repository (see https://github.com/dmlc/xgboost/issues/799). The corresponding hyperparameter is base_score, if you set it equal to zero your model will predict zero as expected.

from numpy import array
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor

tree_number = 100
depth = 10
jobs = 1
dimension = 19

sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)
xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, base_score=0, random_state=42, n_jobs=jobs)

dataset = array([[0.0] * dimension, [0.0] * dimension])
y_val = array([0.0, 0.0])

sk_VAL.fit(dataset, y_val)
xgb_VAL.fit(dataset, y_val)

sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))

print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))
#sk_prediction = [0.]
#xgb_prediction = [0.] 
like image 63
Flavia Giammarino Avatar answered Nov 15 '22 04:11

Flavia Giammarino