Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Poisson Regression in xgboost Fails for Low Frequencies

Tags:

python

xgboost

I'm trying to implement a boosted Poisson regression model in xgboost, but I am finding the results are biased at low frequencies. To illustrate, here is some minimal Python code that I think replicates the issue:

import numpy as np
import pandas as pd
import xgboost as xgb

def get_preds(mult):
    # generate toy dataset for illustration
    # 4 observations with linearly increasing frequencies
    # the frequencies are scaled by `mult`
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
                       label=[i*mult for i in [1, 2, 3, 4]],
                       weight=[1000, 1000, 1000, 1000])

    # train a poisson booster on the toy data
    bst = xgb.train(
        params={"objective": "count:poisson"},
        dtrain=dmat,
        num_boost_round=100000,
        early_stopping_rounds=5,
        evals=[(dmat, "train")],
        verbose_eval=False)

    # return fitted frequencies after reversing scaling
    return bst.predict(dmat)/mult

# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)]
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 0))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df

# --- result ---
#               (0, 0)   (0, 1)   (1, 0)   (1, 1)
#1.000000e-08  11598.0  11598.0  11598.0  11598.0
#1.000000e-07   1161.0   1161.0   1161.0   1161.0
#1.000000e-06    118.0    118.0    118.0    118.0
#1.000000e-05     12.0     12.0     12.0     12.0
#1.000000e-04      2.0      2.0      3.0      3.0
#1.000000e-03      1.0      2.0      3.0      4.0
#1.000000e-02      1.0      2.0      3.0      4.0
#1.000000e-01      1.0      2.0      3.0      4.0
#1.000000e+00      1.0      2.0      3.0      4.0

Notice that at low frequencies the predictions seem to blow up. This may have something to do with the Poisson lambda * the weight dropping below 1 (and in fact increasing the weight above 1000 does shift the "blow-up" to lower frequencies), but I would still expect the predictions to approach the mean training frequency (2.5). Also (not shown in the example above), reducing eta seems to increase the amount of bias in the predictions.

What would cause this to happen? Is a parameter available that would mitigate the effect?

like image 633
Bryn Clarke Avatar asked Oct 11 '17 23:10

Bryn Clarke


People also ask

What are the assumptions of Poisson regression?

Assumptions of Poisson regressionChanges in the rate from combined effects of different explanatory variables are multiplicative. At each level of the covariates the number of cases has variance equal to the mean (as in the Poisson distribution). Errors are independent of each other.

What is Poisson deviance?

Poisson deviance is equivalent to the Tweedie deviance with the power parameter power=1 . Read more in the User Guide. Parameters: y_truearray-like of shape (n_samples,) Ground truth (correct) target values.


1 Answers

After some digging, I found a solution. Documenting here in case someone else runs into the same issue. It turns out I needed to add an offset term equal to the (natural) log of the mean frequency. If that isn't immediately obvious, it's because the initial prediction starts at a frequency of 0.5, and many boosting iterations are needed just to rescale the predictions to the average frequency.

See the below code for an update to the toy example. As I suggested in the original question, the predictions now approach the mean frequency (2.5) at the lower scales.

import numpy as np
import pandas as pd
import xgboost as xgb

def get_preds(mult):
    # generate toy dataset for illustration
    # 4 observations with linearly increasing frequencies
    # the frequencies are scaled by `mult`
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
                       label=[i*mult for i in [1, 2, 3, 4]],
                       weight=[1000, 1000, 1000, 1000])

    ## adding an offset term equal to the log of the mean frequency
    offset = np.log(np.mean([i*mult for i in [1, 2, 3, 4]]))
    dmat.set_base_margin(np.repeat(offset, 4))

    # train a poisson booster on the toy data
    bst = xgb.train(
        params={"objective": "count:poisson"},
        dtrain=dmat,
        num_boost_round=100000,
        early_stopping_rounds=5,
        evals=[(dmat, "train")],
        verbose_eval=False)

    # return fitted frequencies after reversing scaling
    return bst.predict(dmat)/mult

# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)]
## round to 1 decimal point to show the result approaches 2.5
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 1))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df

# --- result ---
#              (0, 0)  (0, 1)  (1, 0)  (1, 1)
#1.000000e-08     2.5     2.5     2.5     2.5
#1.000000e-07     2.5     2.5     2.5     2.5
#1.000000e-06     2.5     2.5     2.5     2.5
#1.000000e-05     2.5     2.5     2.5     2.5
#1.000000e-04     2.4     2.5     2.5     2.6
#1.000000e-03     1.0     2.0     3.0     4.0
#1.000000e-02     1.0     2.0     3.0     4.0
#1.000000e-01     1.0     2.0     3.0     4.0
#1.000000e+00     1.0     2.0     3.0     4.0
like image 192
Bryn Clarke Avatar answered Oct 23 '22 03:10

Bryn Clarke