Poisson Regression in xgboost Fails for Low Frequencies

Tags:

I'm trying to implement a boosted Poisson regression model in xgboost, but I am finding the results are biased at low frequencies. To illustrate, here is some minimal Python code that I think replicates the issue:

import numpy as np
import pandas as pd
import xgboost as xgb

def get_preds(mult):
    # generate toy dataset for illustration
    # 4 observations with linearly increasing frequencies
    # the frequencies are scaled by `mult`
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
                       label=[i*mult for i in [1, 2, 3, 4]],
                       weight=[1000, 1000, 1000, 1000])

    # train a poisson booster on the toy data
    bst = xgb.train(
        params={"objective": "count:poisson"},
        dtrain=dmat,
        num_boost_round=100000,
        early_stopping_rounds=5,
        evals=[(dmat, "train")],
        verbose_eval=False)

    # return fitted frequencies after reversing scaling
    return bst.predict(dmat)/mult

# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)]
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 0))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df

# --- result ---
#               (0, 0)   (0, 1)   (1, 0)   (1, 1)
#1.000000e-08  11598.0  11598.0  11598.0  11598.0
#1.000000e-07   1161.0   1161.0   1161.0   1161.0
#1.000000e-06    118.0    118.0    118.0    118.0
#1.000000e-05     12.0     12.0     12.0     12.0
#1.000000e-04      2.0      2.0      3.0      3.0
#1.000000e-03      1.0      2.0      3.0      4.0
#1.000000e-02      1.0      2.0      3.0      4.0
#1.000000e-01      1.0      2.0      3.0      4.0
#1.000000e+00      1.0      2.0      3.0      4.0

Notice that at low frequencies the predictions seem to blow up. This may have something to do with the Poisson lambda * the weight dropping below 1 (and in fact increasing the weight above 1000 does shift the "blow-up" to lower frequencies), but I would still expect the predictions to approach the mean training frequency (2.5). Also (not shown in the example above), reducing eta seems to increase the amount of bias in the predictions.

What would cause this to happen? Is a parameter available that would mitigate the effect?

633

asked Oct 11 '17 23:10

Bryn Clarke

1 Answers

After some digging, I found a solution. Documenting here in case someone else runs into the same issue. It turns out I needed to add an offset term equal to the (natural) log of the mean frequency. If that isn't immediately obvious, it's because the initial prediction starts at a frequency of 0.5, and many boosting iterations are needed just to rescale the predictions to the average frequency.

See the below code for an update to the toy example. As I suggested in the original question, the predictions now approach the mean frequency (2.5) at the lower scales.

import numpy as np
import pandas as pd
import xgboost as xgb

def get_preds(mult):
    # generate toy dataset for illustration
    # 4 observations with linearly increasing frequencies
    # the frequencies are scaled by `mult`
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
                       label=[i*mult for i in [1, 2, 3, 4]],
                       weight=[1000, 1000, 1000, 1000])

    ## adding an offset term equal to the log of the mean frequency
    offset = np.log(np.mean([i*mult for i in [1, 2, 3, 4]]))
    dmat.set_base_margin(np.repeat(offset, 4))

    # train a poisson booster on the toy data
    bst = xgb.train(
        params={"objective": "count:poisson"},
        dtrain=dmat,
        num_boost_round=100000,
        early_stopping_rounds=5,
        evals=[(dmat, "train")],
        verbose_eval=False)

    # return fitted frequencies after reversing scaling
    return bst.predict(dmat)/mult

# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)]
## round to 1 decimal point to show the result approaches 2.5
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 1))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df

# --- result ---
#              (0, 0)  (0, 1)  (1, 0)  (1, 1)
#1.000000e-08     2.5     2.5     2.5     2.5
#1.000000e-07     2.5     2.5     2.5     2.5
#1.000000e-06     2.5     2.5     2.5     2.5
#1.000000e-05     2.5     2.5     2.5     2.5
#1.000000e-04     2.4     2.5     2.5     2.6
#1.000000e-03     1.0     2.0     3.0     4.0
#1.000000e-02     1.0     2.0     3.0     4.0
#1.000000e-01     1.0     2.0     3.0     4.0
#1.000000e+00     1.0     2.0     3.0     4.0

192

answered Oct 23 '22 03:10

Bryn Clarke

Related questions
                            
                                Trying to understand masking
                            
                                Xgboost plot_tree Error: ValueError: booster must be Booster instance
                            
                                Whether my file is opened in RAM while in append mode?
                            
                                Python CSV writer blank line
                            
                                What are the differences between the purposes of generator functions and asynchronous generator functions
                            
                                Extracting and transforming data in numpy
                            
                                Memory usage not getting lowered even after job is completed successfully
                            
                                Pandas dataframe group: sum one column, take first element from others
                            
                                pd.merge_asof fails with 'ValueError: left keys must be sorted' on second run
                            
                                How to pass parameters to python function using argparse?
                            
                                How to convert Earth Centered Inertial (ECI) coordinates to Earth Centered Earth Fixed (ECEF) AstroPy? Other?
                            
                                codecs.open(utf-8) fails to read plain ASCII file
                            
                                OpenCV python canny Required argument 'threshold2' (pos 4) not found
                            
                                python looping and creating new dataframe for each value of a column
                            
                                Overlapping axis tick labels in logarithmic plots
                            
                                How to install regular python (via homebrew) and miniconda in the same computer?
                            
                                python , opencv, image array to binary
                            
                                Django Rest Framework - OPTIONS request - Get foreign key choices
                            
                                Any limitations on platform constraints for wheels on PyPI?
                            
                                Is there a callable equivalent to f-string syntax?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Poisson Regression in xgboost Fails for Low Frequencies

Tags:

python

xgboost

Bryn Clarke

People also ask

1 Answers

Bryn Clarke

Recent Activity

Donate For Us