Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Xgboost cox survival time entry

In the new implementation of cox ph survival model in xgboost 0.81 how does one specify start and end time of an event?

Thanks

The R equivalent function would be for example :

cph_mod = coxph(Surv(Start, Stop, Status) ~ Age + Sex + SBP, data=data)
like image 499
aiedu Avatar asked Jan 01 '23 13:01

aiedu


1 Answers

XGBoost do not allow for start (i.e. delayed entry). If it makes sense for the application, you can always change the underlying time scale so all subjects start at time=0. However, XGBoost does allow for right censored data. It seems impossible to find any documentation/example for how to implement a Cox model, but from the source code you can read "Cox regression for censored survival data (negative labels are considered censored)."

Here is a short example for anyone who want to try XGBoost with obj="survival:cox". We can compare the results to to the scikit-learn survival package sksurv. To make XGBoost more similar to that framework we use a linear booster instead of a tree booster.

import pandas as pd
import xgboost as xgb
from sksurv.datasets import load_aids
from sksurv.linear_model import CoxPHSurvivalAnalysis

# load and inspect the data
data_x, data_y = load_aids()
data_y[10:15]
Out[586]: 
array([(False, 334.), (False, 285.), (False, 265.), ( True, 206.),
   (False, 305.)], dtype=[('censor', '?'), ('time', '<f8')])

# Since XGBoost only allow one column for y, the censoring information
# is coded as negative values:
data_y_xgb = [x[1] if x[0] else -x[1] for x in data_y]
data_y_xgb[10:15]
Out[3]: [-334.0, -285.0, -265.0, 206.0, -305.0]

data_x = data_x[['age', 'cd4']]
data_x.head()
Out[4]: 
    age    cd4
0  34.0  169.0
1  34.0  149.5
2  20.0   23.5
3  48.0   46.0
4  46.0   10.0

# Since sksurv output log hazard ratios (here relative to 0 on predictors)
# we must use 'output_margin=True' for comparability.
estimator = CoxPHSurvivalAnalysis().fit(data_x, data_y)
gbm = xgb.XGBRegressor(objective='survival:cox',
                       booster='gblinear',
                       base_score=1,
                       n_estimators=1000).fit(data_x, data_y_xgb)
prediction_sksurv = estimator.predict(data_x)
predictions_xgb = gbm.predict(data_x, output_margin=True)
d = pd.DataFrame({'xgb': predictions_xgb,
                  'sksurv': prediction_sksurv})
d.head()
Out[13]: 
     sksurv       xgb
0 -1.892490 -1.843828
1 -1.569389 -1.524385
2  0.144572  0.207866
3  0.519293  0.502953
4  1.062392  1.045287

d.plot.scatter('xgb', 'sksurv')

enter image description here

Note that these are predictions on the same data that was use to fit the model. It seems that XGBoost get the values right but sometimes with a linear transformation. I do not know why. Play around with base_score and n_estimators. Perhaps someone can add to this answer.

like image 146
PeterStrom Avatar answered Jan 10 '23 07:01

PeterStrom