Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XGBoost - Poisson distribution with varying exposure / offset

I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?

(A similar question was posted here: xgboost, offset exposure?)

To illustrate the issue, the R code below generates some data with the fields:

  • x1, x2 - factors (either 0 or 1)
  • exposure - length of policy period on observed data
  • frequency - mean number of claims per unit exposure
  • claims - number of observed claims ~Poisson(frequency*exposure)

The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.

Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.

The code tries to predict this using XGBoost by:

  1. Setting exposure as a weight in the model matrix
  2. Setting log(exposure) as an offset

Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.

set.seed(1)
size<-10000
d <- data.frame(
  x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)

#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "logloss"
  , "eta" = 1
  , "subsample" = 1
  , "colsample_bytree" = 1
  , "min_child_weight" = 1
  , "max_depth" = 2
)

## 1 - set weight in xgb.Matrix

xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_1 <- predict(xgb, xgtrain)

## 2 - set as offset in xgb.Matrix
xgtrain.mf  <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m  <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain  <- xgb.DMatrix(xgtrain.m,label = d$claims)

xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_2 <- predict(model, xgtrain)

#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
              data = d,
              method = "poisson")

d$Tree_F <- predict(tree, newdata = d)

#### Fit a GBM

gbm <- gbm(claims~x1+x2+offset(log(exposure)), 
           data = d,
           distribution = "poisson",
           n.trees = 1,
           shrinkage=1,
           interaction.depth=2,
           bag.fraction = 0.5)

d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")
like image 330
Pete Lowth Avatar asked Feb 26 '16 19:02

Pete Lowth


Video Answer


2 Answers

At least with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') is equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it for yourself using a sample data set

I'm not exactly sure what objective='count:poisson' corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost would be the way to go when exposures are varying.

like image 113
Vinh Nguyen Avatar answered Oct 05 '22 19:10

Vinh Nguyen


I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:

setinfo(xgtrain, "base_margin", log(d$exposure))
like image 37
Pete Lowth Avatar answered Oct 05 '22 19:10

Pete Lowth