This note describes a few elementary aspects of practical analysis of
survival data in R. For further information we refer to the book Introductory Statistics with R
by Peter Dalgaard and Modeling Survival Data by
Terry M. Therneau (the author of the survival
-package for R) and Patricia M. Grambsch.
A more classical and general reference is Statistical Models Based on Counting Processes
by Andersen, Borgan, Gill and Keiding.
The note is in addition to the methods covered in Harrell's book.
The analyses that we will do can all be done using the functions in
R package survival
. That package also contains some example data sets.
library(survival)
## Loading required package: splines
data(pbc)
pbc[1:10, c("time", "status", "age", "edema", "albumin", "bili", "protime")]
## time status age edema albumin bili protime
## 1 400 2 58.77 1.0 2.60 14.5 12.2
## 2 4500 0 56.45 0.0 4.14 1.1 10.6
## 3 1012 2 70.07 0.5 3.48 1.4 12.0
## 4 1925 2 54.74 0.5 2.54 1.8 10.3
## 5 1504 1 38.11 0.0 3.53 3.4 10.9
## 6 2503 2 66.26 0.0 3.98 0.8 11.0
## 7 1832 0 55.53 0.0 4.09 1.0 9.7
## 8 2466 2 53.06 0.0 4.00 0.3 11.0
## 9 2400 2 42.51 0.0 3.08 3.2 11.0
## 10 51 2 70.56 1.0 2.74 12.6 11.5
The data set pbc
(primary biliary cirrhosis) contains the main variable time
, which is the
observed, potentially censored survival time and the status
, which indicates if
the observation has been censored. In addition there are 17 observed covariates, where we
show above five that in previous analyzes have been found to be important.
We can produce a plot of the survival times with :
nr <- 50
plot(c(0, pbc$time[1]), c(1, 1), type = "l", ylim = c(0, nr + 1), xlim = c(0,
max(pbc$time[1:nr]) + 10), ylab = "nr", xlab = "Survival time")
for (i in 2:nr) lines(c(0, pbc$time[i]), c(i, i))
for (i in 1:nr) {
if (pbc$status[i] == 0)
points(pbc$time[i], i, col = "red", pch = 20) #censored
if (pbc$status[i] == 1)
points(pbc$time[i], i)
}
In the complete dataset there are 418 patients, and the status variable shows that 161 have died, 232 have been censored and then 25 have got a transplant. We exclude those 25 observations from the further analysis. Furthermore, we change the status variable to a logical for subsequent use.
pbc <- subset(pbc, status != 1)
pbc <- transform(pbc, status = as.logical(status))
Estimation of the survival function using the Kaplan-Meier estimator can be done using
the survfit
function.
pbcSurv <- survfit(Surv(time, status) ~ 1, data = pbc)
pbcSurv
## Call: survfit(formula = Surv(time, status) ~ 1, data = pbc)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 393 393 393 161 3358 2847 3839
plot(pbcSurv, mark.time = FALSE)
The result from survfit is an R object of type
survfit
. Calling summary
with a survfit
object as argument
provides the non-parametric estimate of the survival function
including additional information as a long list. The plot with the added, pointwise confidence bands
is perhaps more informative. Setting mark.time = FALSE
prevents all the censored observations
to be marked on the plot by a “+”.
The function Surv
applied to the time
and status
variables
for the PBC data is a function that creates a survival
object.
We can compute the Kaplan-Meier and the Nelson-Aalen estimators directly. First we form the individuals at risk process for the subsequent computations.
risk <- with(pbc, data.frame(Y = nrow(pbc):1, time = sort(time)))
surv <- with(pbc, risk[status[order(time)], ])
plot(Y ~ time, data = risk, type = "l")
points(Y ~ time, data = surv, col = "red") ## uncensored survivals
Then we compute the two non-parametric estimators of the survival function.
### Kaplan-Meier and Nelson-Aalen estimates
surv <- transform(surv, KM = cumprod(1 - 1/Y), NAa = exp(-cumsum(1/Y)))
plot(KM ~ time, data = surv, type = "s", ylim = c(0, 1), xlim = c(0, 4200))
lines(NAa ~ time, data = surv, type = "s", col = "blue")
It is difficult to see any differences. We can also compute and plot the estimated cumulative hazard functions. Then it is possible to observe small differences in the tail.
plot(-log(KM) ~ time, data = surv, type = "s", xlim = c(0, 4200))
lines(-log(NAa) ~ time, data = surv, type = "s", col = "blue")
A major point in a data set like the PBC data above is that we have
covariate information on the individuals. It is no problem to split
the estimation of the survival function according to a factor such as
the edema
variable in the PBC data.
pbcSurv <- survfit(Surv(time, status) ~ edema, data = pbc)
plot(pbcSurv, mark.time = FALSE, conf.int = TRUE, col = c("red", "blue", "purple"))
From the plot it seems clear that individuals without edema have a larger survival time than those with edema, and those treated successfully or left untreated have a larger survival time than those treated unsuccessfully. This can also be tested formally with the so-called log-rank test.
survdiff(Surv(time, status) ~ edema, data = pbc)
## Call:
## survdiff(formula = Surv(time, status) ~ edema, data = pbc)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## edema=0 332 116 145.38 5.94 61.6
## edema=0.5 41 26 13.00 12.99 14.2
## edema=1 20 19 2.61 102.79 105.4
##
## Chisq= 123 on 2 degrees of freedom, p= 0
But what about including the other covariates also “–” especially the continuous ones? This can be handled in the framework of the Cox proportional hazards model.
pbcSurv <- coxph(Surv(time, status) ~ as.factor(edema), data = pbc)
summary(pbcSurv)
## Call:
## coxph(formula = Surv(time, status) ~ as.factor(edema), data = pbc)
##
## n= 393, number of events= 161
##
## coef exp(coef) se(coef) z Pr(>|z|)
## as.factor(edema)0.5 0.933 2.543 0.217 4.29 1.8e-05 ***
## as.factor(edema)1 2.277 9.747 0.254 8.95 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## as.factor(edema)0.5 2.54 0.393 1.66 3.89
## as.factor(edema)1 9.75 0.103 5.92 16.05
##
## Concordance= 0.625 (se = 0.013 )
## Rsquare= 0.142 (max possible= 0.988 )
## Likelihood ratio test= 60.4 on 2 df, p=7.84e-14
## Wald test = 88.1 on 2 df, p=0
## Score (logrank) test = 123 on 2 df, p=0
We can tell from the summary that the \( \beta \)'s corresponding to the
covariate edema
are significantly different from 0. We
also see that the hazard rate for those with edema=1
is
estimated as roughly 10 times the base-line hazard rate \( \lambda_0(t) \).
We can include more covariates.
pbcSurv <- coxph(Surv(time, status) ~ as.factor(edema) + age + albumin + bili +
protime, data = pbc)
summary(pbcSurv)
## Call:
## coxph(formula = Surv(time, status) ~ as.factor(edema) + age +
## albumin + bili + protime, data = pbc)
##
## n= 391, number of events= 160
## (2 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## as.factor(edema)0.5 0.16133 1.17507 0.23181 0.70 0.48645
## as.factor(edema)1 1.04124 2.83274 0.29616 3.52 0.00044 ***
## age 0.03697 1.03766 0.00819 4.52 6.3e-06 ***
## albumin -1.00671 0.36542 0.20575 -4.89 9.9e-07 ***
## bili 0.12139 1.12907 0.01348 9.00 < 2e-16 ***
## protime 0.17767 1.19443 0.05827 3.05 0.00230 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## as.factor(edema)0.5 1.175 0.851 0.746 1.851
## as.factor(edema)1 2.833 0.353 1.585 5.062
## age 1.038 0.964 1.021 1.054
## albumin 0.365 2.737 0.244 0.547
## bili 1.129 0.886 1.100 1.159
## protime 1.194 0.837 1.066 1.339
##
## Concordance= 0.81 (se = 0.025 )
## Rsquare= 0.371 (max possible= 0.988 )
## Likelihood ratio test= 181 on 6 df, p=0
## Wald test = 216 on 6 df, p=0
## Score (logrank) test = 327 on 6 df, p=0
It is of course also possible to transform the covariates before using them in the regression. For example,
pbcSurv <- coxph(Surv(time, status) ~ as.factor(edema) + age + log(bili) + log(albumin) +
log(protime), data = pbc)
summary(pbcSurv)
## Call:
## coxph(formula = Surv(time, status) ~ as.factor(edema) + age +
## log(bili) + log(albumin) + log(protime), data = pbc)
##
## n= 391, number of events= 160
## (2 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## as.factor(edema)0.5 0.25301 1.28789 0.22599 1.12 0.2629
## as.factor(edema)1 0.94466 2.57194 0.28954 3.26 0.0011 **
## age 0.03680 1.03749 0.00773 4.76 1.9e-06 ***
## log(bili) 0.88805 2.43039 0.08471 10.48 < 2e-16 ***
## log(albumin) -2.61669 0.07304 0.65204 -4.01 6.0e-05 ***
## log(protime) 2.01116 7.47196 0.79429 2.53 0.0113 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## as.factor(edema)0.5 1.288 0.776 0.8270 2.006
## as.factor(edema)1 2.572 0.389 1.4582 4.536
## age 1.037 0.964 1.0219 1.053
## log(bili) 2.430 0.411 2.0586 2.869
## log(albumin) 0.073 13.690 0.0204 0.262
## log(protime) 7.472 0.134 1.5752 35.444
##
## Concordance= 0.834 (se = 0.025 )
## Rsquare= 0.445 (max possible= 0.988 )
## Likelihood ratio test= 230 on 6 df, p=0
## Wald test = 234 on 6 df, p=0
## Score (logrank) test = 320 on 6 df, p=0
One will need tools for model diagnostic. One simple idea is as follows. The survival functions, \( S_i(t) \), for different individuals, \( i \), must fulfill that \[ \log(-\log(S_i(t)) = \log \int_0^t \lambda_0(s) \mathrm{d} s + X_i^T\beta \] are parallel. Now we don't know each individual survival function but if we group individuals with similar covariates we might be able to estimate, non-parametrically, the survival functions for the groups and then visually inspect if the survival functions look parallel.
Considering only one variable, bili
, we divide into three groups.
group <- 1 * (pbc$bili > 1.1) + 1 * (pbc$bili > 3.3)
cfit <- survfit(Surv(time, status) ~ group, data = pbc)
plot(cfit, mark.time = FALSE, fun = "cloglog", xlim = c(40, 5000))
Including in addition edema
, which is already discrete
we get a total of 9 groups. Note the additive specification in the formula
below, which in reality stratifies the estimation into all 9 possible
combinations.
cfit <- survfit(Surv(time, status) ~ group + edema, data = pbc)
plot(cfit, mark.time = FALSE, fun = "cloglog", xlim = c(40, 5000), col = rainbow(9))
It is left to the reader to judge the plot …
A good question is what happens if more stratification according to other variables is applied. The above method can be found on page 43-44 in Therneau and Grambsch. It may not be particularly sensible if there are more significant covariates than those that are stratified according to.
If we look at the formula above the “correct” way to proceed is to stratify according to one covariate, estimate the Cox model in each stratum based on the other covariates and then compare the non-parametric estimates of the log of the integrated base line intensities, that is, the first term in the formula for \( \log(-\log(S_i(t)) \). This is described in details in Andersen, Borgan, Gill og Keiding VII.3, pages 539-545.
The procedure above requires that one can estimate the integrated base line intensity non-parametrically “–” and we cannot fall back on the Kaplan-Meier estimator. It is, however, not a problem to construct such an estimator, but we will not pursue this here, see Andersen et al.