I'm trying to implement a boosted Poisson regression model in xgboost, but I am finding the results are biased at low frequencies. To illustrate, here is some minimal Python code that I think replicates the issue:
import numpy as np
import pandas as pd
import xgboost as xgb
def get_preds(mult):
# generate toy dataset for illustration
# 4 observations with linearly increasing frequencies
# the frequencies are scaled by `mult`
dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
label=[i*mult for i in [1, 2, 3, 4]],
weight=[1000, 1000, 1000, 1000])
# train a poisson booster on the toy data
bst = xgb.train(
params={"objective": "count:poisson"},
dtrain=dmat,
num_boost_round=100000,
early_stopping_rounds=5,
evals=[(dmat, "train")],
verbose_eval=False)
# return fitted frequencies after reversing scaling
return bst.predict(dmat)/mult
# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies
mults = [10**i for i in range(-8, 1)]
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 0))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df
# --- result ---
# (0, 0) (0, 1) (1, 0) (1, 1)
#1.000000e-08 11598.0 11598.0 11598.0 11598.0
#1.000000e-07 1161.0 1161.0 1161.0 1161.0
#1.000000e-06 118.0 118.0 118.0 118.0
#1.000000e-05 12.0 12.0 12.0 12.0
#1.000000e-04 2.0 2.0 3.0 3.0
#1.000000e-03 1.0 2.0 3.0 4.0
#1.000000e-02 1.0 2.0 3.0 4.0
#1.000000e-01 1.0 2.0 3.0 4.0
#1.000000e+00 1.0 2.0 3.0 4.0
Notice that at low frequencies the predictions seem to blow up. This may have something to do with the Poisson lambda * the weight dropping below 1 (and in fact increasing the weight above 1000 does shift the "blow-up" to lower frequencies), but I would still expect the predictions to approach the mean training frequency (2.5). Also (not shown in the example above), reducing eta
seems to increase the amount of bias in the predictions.
What would cause this to happen? Is a parameter available that would mitigate the effect?
Assumptions of Poisson regressionChanges in the rate from combined effects of different explanatory variables are multiplicative. At each level of the covariates the number of cases has variance equal to the mean (as in the Poisson distribution). Errors are independent of each other.
Poisson deviance is equivalent to the Tweedie deviance with the power parameter power=1 . Read more in the User Guide. Parameters: y_truearray-like of shape (n_samples,) Ground truth (correct) target values.
After some digging, I found a solution. Documenting here in case someone else runs into the same issue. It turns out I needed to add an offset term equal to the (natural) log of the mean frequency. If that isn't immediately obvious, it's because the initial prediction starts at a frequency of 0.5, and many boosting iterations are needed just to rescale the predictions to the average frequency.
See the below code for an update to the toy example. As I suggested in the original question, the predictions now approach the mean frequency (2.5) at the lower scales.
import numpy as np
import pandas as pd
import xgboost as xgb
def get_preds(mult):
# generate toy dataset for illustration
# 4 observations with linearly increasing frequencies
# the frequencies are scaled by `mult`
dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
label=[i*mult for i in [1, 2, 3, 4]],
weight=[1000, 1000, 1000, 1000])
## adding an offset term equal to the log of the mean frequency
offset = np.log(np.mean([i*mult for i in [1, 2, 3, 4]]))
dmat.set_base_margin(np.repeat(offset, 4))
# train a poisson booster on the toy data
bst = xgb.train(
params={"objective": "count:poisson"},
dtrain=dmat,
num_boost_round=100000,
early_stopping_rounds=5,
evals=[(dmat, "train")],
verbose_eval=False)
# return fitted frequencies after reversing scaling
return bst.predict(dmat)/mult
# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies
mults = [10**i for i in range(-8, 1)]
## round to 1 decimal point to show the result approaches 2.5
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 1))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df
# --- result ---
# (0, 0) (0, 1) (1, 0) (1, 1)
#1.000000e-08 2.5 2.5 2.5 2.5
#1.000000e-07 2.5 2.5 2.5 2.5
#1.000000e-06 2.5 2.5 2.5 2.5
#1.000000e-05 2.5 2.5 2.5 2.5
#1.000000e-04 2.4 2.5 2.5 2.6
#1.000000e-03 1.0 2.0 3.0 4.0
#1.000000e-02 1.0 2.0 3.0 4.0
#1.000000e-01 1.0 2.0 3.0 4.0
#1.000000e+00 1.0 2.0 3.0 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With