XGBoost

Question

I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?

(A similar question was posted here: xgboost, offset exposure?)

To illustrate the issue, the R code below generates some data with the fields:

x1, x2 - factors (either 0 or 1)
exposure - length of policy period on observed data
frequency - mean number of claims per unit exposure
claims - number of observed claims ~Poisson(frequency*exposure)

The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.

Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.

The code tries to predict this using XGBoost by:

Setting exposure as a weight in the model matrix
Setting log(exposure) as an offset

Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.

set.seed(1)
size<-10000
d <- data.frame(
  x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)

#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "logloss"
  , "eta" = 1
  , "subsample" = 1
  , "colsample_bytree" = 1
  , "min_child_weight" = 1
  , "max_depth" = 2
)

## 1 - set weight in xgb.Matrix

xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_1 <- predict(xgb, xgtrain)

## 2 - set as offset in xgb.Matrix
xgtrain.mf  <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m  <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain  <- xgb.DMatrix(xgtrain.m,label = d$claims)

xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_2 <- predict(model, xgtrain)

#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
              data = d,
              method = "poisson")

d$Tree_F <- predict(tree, newdata = d)

#### Fit a GBM

gbm <- gbm(claims~x1+x2+offset(log(exposure)), 
           data = d,
           distribution = "poisson",
           n.trees = 1,
           shrinkage=1,
           interaction.depth=2,
           bag.fraction = 0.5)

d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")

Vinh Nguyen · Accepted Answer

At least with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') is equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it for yourself using a sample data set

I'm not exactly sure what objective='count:poisson' corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost would be the way to go when exposures are varying.

Pete Lowth · Answer

I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:

setinfo(xgtrain, "base_margin", log(d$exposure))

XGBoost - Poisson distribution with varying exposure / offset

Tags:

r

offset

poisson

Pete Lowth

Video Answer

2 Answers

Vinh Nguyen

Pete Lowth

Recent Activity

Donate For Us

XGBoost - Poisson distribution with varying exposure / offset

Tags:

r

offset

xgboost

poisson

Pete Lowth

Video Answer

2 Answers

Vinh Nguyen

Pete Lowth

Related questions

Recent Activity

Donate For Us