I have a survfit
object. A summary survfit for my t=0:50
years of interest is easy enough.
summary(survfit, t=0:50)
It gives the survival at each t.
Is there a way to get the hazard for each t (in this case, the hazard from t-1 to t in each t=0:50)? I want to get the mean and confidence interval (or standard error) for the hazards relating to the Kaplan Meier curve.
This seems easy to do when a distribution is fit (eg. type="hazard"
in flexsurvreg
) but I can't figure out how to do this for a regular survfit object. Suggestions?
The hazard function (also called the force of mortality, instantaneous failure rate, instantaneous death rate, or age-specific failure rate) is a way to model data distribution in survival analysis. The most common use of the function is to model a participant's chance of death as a function of their age.
This cumulative hazard function can be thought of as providing the total accumulated risk of experiencing the event of interest that has been gained by progressing to time t. While the instantaneous hazard rate (h(t)) can increase or decrease with time, the cumulative hazard rate can only increase or remain the same.
The R package named survival is used to carry out survival analysis. This package contains the function Surv() which takes the input data as a R formula and creates a survival object among the chosen variables for analysis. Then we use the function survfit() to create a plot for the analysis.
It is a bit tricky since the hazard is an estimate of an instantaneous probability (and this is discrete data), but the basehaz
function might be of some help, but it only returns the cumulative hazard. So you would have still have to perform an extra step.
I have also had luck with the muhaz
function. From its documentation:
library(muhaz)
?muhaz
data(ovarian, package="survival")
attach(ovarian)
fit1 <- muhaz(futime, fustat)
plot(fit1)
I am not sure the best way to get at the 95% confidence interval, but bootstrapping might be one approach.
#Function to bootstrap hazard estimates
haz.bootstrap <- function(data,trial,min.time,max.time){
library(data.table)
data <- as.data.table(data)
data <- data[sample(1:nrow(data),nrow(data),replace=T)]
fit1 <- muhaz(data$futime, data$fustat,min.time=min.time,max.time=max.time)
result <- data.table(est.grid=fit1$est.grid,trial,haz.est=fit1$haz.est)
return(result)
}
#Re-run function to get 1000 estimates
haz.list <- lapply(1:1000,function(x) haz.bootstrap(data=ovarian,trial=x,min.time=0,max.time=744))
haz.table <- rbindlist(haz.list,fill=T)
#Calculate Mean,SD,upper and lower 95% confidence bands
plot.table <- haz.table[, .(Mean=mean(haz.est),SD=sd(haz.est)), by=est.grid]
plot.table[, u95 := Mean+1.96*SD]
plot.table[, l95 := Mean-1.96*SD]
#Plot graph
library(ggplot2)
p <- ggplot(data=plot.table)+geom_smooth(aes(x=est.grid,y=Mean))
p <- p+geom_smooth(aes(x=est.grid,y=u95),linetype="dashed")
p <- p+geom_smooth(aes(x=est.grid,y=l95),linetype="dashed")
p
As a supplement to Mike's answer, one could model the number of events by a Poisson distribution instead of a Normal distribution. The hazard rate can then be calculated via a gamma distribution. The code would become:
library(muhaz)
library(data.table)
library(rGammaGamma)
data(ovarian, package="survival")
attach(ovarian)
fit1 <- muhaz(futime, fustat)
plot(fit1)
#Function to bootstrap hazard estimates
haz.bootstrap <- function(data,trial,min.time,max.time){
library(data.table)
data <- as.data.table(data)
data <- data[sample(1:nrow(data),nrow(data),replace=T)]
fit1 <- muhaz(data$futime, data$fustat,min.time=min.time,max.time=max.time)
result <- data.table(est.grid=fit1$est.grid,trial,haz.est=fit1$haz.est)
return(result)
}
#Re-run function to get 1000 estimates
haz.list <- lapply(1:1000,function(x) haz.bootstrap(data=ovarian,trial=x,min.time=0,max.time=744))
haz.table <- rbindlist(haz.list,fill=T)
#Calculate Mean, gamma parameters, upper and lower 95% confidence bands
plot.table <- haz.table[, .(Mean=mean(haz.est),
Shape = gammaMME(haz.est)["shape"],
Scale = gammaMME(haz.est)["scale"]), by=est.grid]
plot.table[, u95 := qgamma(0.95,shape = Shape + 1, scale = Scale)]
# The + 1 is due to the discrete character of the poisson distribution.
plot.table[, l95 := qgamma(0.05,shape = Shape, scale = Scale)]
#Plot graph
ggplot(data=plot.table) +
geom_line(aes(x=est.grid, y=Mean),col="blue") +
geom_ribbon(aes(x=est.grid, y=Mean, ymin=l95, ymax=u95),alpha=0.5, fill= "lightblue")
As can be seen the negative estimates for the lower bound of the hazard rate are now gone.
For sake of performance, we could use a more streamlined bootstrap function.
## define custom times
t0 <- 0
t1 <- 744
## bootstrap fun
boot_fun <- function(x) {
n <- dim(x)[1]
x <- x[sample.int(n, n, replace=TRUE), ]
muhaz::muhaz(x$futime, x$fustat, min.time=t0, max.time=t1)
}
# bootstrap
set.seed(42)
R <- 999
B <- replicate(R, boot_fun(ovarian))
The results may be calculated by hand.
## extract matrix from bootstrap
r <- `colnames<-`(t(array(unlist(B[3, ]), dim=c(101, R))), B[2, ][[1]])
## calculate result
library(matrixStats) ## for fast matrix calculations
r <- cbind(x=as.numeric(colnames(r)),
y=colMeans2(r),
shape=(colMeans2(r)/colSds(r))^2,
scale=colVars(r)/colMeans2(r))
r <- cbind(r[, 1:2],
lower=qgamma(0.025, shape=r[, 'shape'] + 1, scale=r[, 'scale']),
upper=qgamma(0.975, shape=r[, 'shape'], scale=r[, 'scale']))
head(r)
# x y lower upper
# [1,] 0.00 0.0003836816 9.359267e-05 0.001400539
# [2,] 7.44 0.0003913992 9.746868e-05 0.001387551
# [3,] 14.88 0.0003997275 1.018897e-04 0.001374005
# [4,] 22.32 0.0004087439 1.069353e-04 0.001360212
# [5,] 29.76 0.0004178464 1.123697e-04 0.001346187
# [6,] 37.20 0.0004275531 1.184685e-04 0.001332237
range(r[, 'y'])
# [1] 0.0003836816 0.0011122910
matplot(r[, 1], r[, -1], type='l', lty=c(1, 2, 2), col=4,
xlab='Time', ylab='Hazard Rate', main='Hazard Estimates')
legend('topleft', legend=c('estimate', '95% CI'), col=4, lty=1:2, cex=.8)
Data
data(cancer, package="survival") ## loads `ovarian` data set
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With