Some survival analysis in R

This note describes a few elementary aspects of practical analysis of survival data in R. For further information we refer to the book Introductory Statistics with R by Peter Dalgaard and Modeling Survival Data by Terry M. Therneau (the author of the survival-package for R) and Patricia M. Grambsch. A more classical and general reference is Statistical Models Based on Counting Processes by Andersen, Borgan, Gill and Keiding.

The note is in addition to the methods covered in Harrell's book.

The analyses that we will do can all be done using the functions in R package survival. That package also contains some example data sets.

library(survival)

## Loading required package: splines

data(pbc)
pbc[1:10, c("time", "status", "age", "edema", "albumin", "bili", "protime")]

##    time status   age edema albumin bili protime
## 1   400      2 58.77   1.0    2.60 14.5    12.2
## 2  4500      0 56.45   0.0    4.14  1.1    10.6
## 3  1012      2 70.07   0.5    3.48  1.4    12.0
## 4  1925      2 54.74   0.5    2.54  1.8    10.3
## 5  1504      1 38.11   0.0    3.53  3.4    10.9
## 6  2503      2 66.26   0.0    3.98  0.8    11.0
## 7  1832      0 55.53   0.0    4.09  1.0     9.7
## 8  2466      2 53.06   0.0    4.00  0.3    11.0
## 9  2400      2 42.51   0.0    3.08  3.2    11.0
## 10   51      2 70.56   1.0    2.74 12.6    11.5

The data set pbc (primary biliary cirrhosis) contains the main variable time, which is the observed, potentially censored survival time and the status, which indicates if the observation has been censored. In addition there are 17 observed covariates, where we show above five that in previous analyzes have been found to be important.

We can produce a plot of the survival times with :

nr <- 50
plot(c(0, pbc$time[1]), c(1, 1), type = "l", ylim = c(0, nr + 1), xlim = c(0, 
    max(pbc$time[1:nr]) + 10), ylab = "nr", xlab = "Survival time")
for (i in 2:nr) lines(c(0, pbc$time[i]), c(i, i))
for (i in 1:nr) {
    if (pbc$status[i] == 0) 
        points(pbc$time[i], i, col = "red", pch = 20)  #censored
    if (pbc$status[i] == 1) 
        points(pbc$time[i], i)
}

plot of chunk unnamed-chunk-2

In the complete dataset there are 418 patients, and the status variable shows that 161 have died, 232 have been censored and then 25 have got a transplant. We exclude those 25 observations from the further analysis. Furthermore, we change the status variable to a logical for subsequent use.

pbc <- subset(pbc, status != 1)
pbc <- transform(pbc, status = as.logical(status))

Estimation of the survival function using the Kaplan-Meier estimator can be done using the survfit function.

pbcSurv <- survfit(Surv(time, status) ~ 1, data = pbc)
pbcSurv

## Call: survfit(formula = Surv(time, status) ~ 1, data = pbc)
## 
## records   n.max n.start  events  median 0.95LCL 0.95UCL 
##     393     393     393     161    3358    2847    3839

plot(pbcSurv, mark.time = FALSE)

plot of chunk unnamed-chunk-4

The result from survfit is an R object of type survfit. Calling summary with a survfit object as argument provides the non-parametric estimate of the survival function including additional information as a long list. The plot with the added, pointwise confidence bands
is perhaps more informative. Setting mark.time = FALSE prevents all the censored observations to be marked on the plot by a “+”.

The function Surv applied to the time and status variables for the PBC data is a function that creates a survival object.

We can compute the Kaplan-Meier and the Nelson-Aalen estimators directly. First we form the individuals at risk process for the subsequent computations.

risk <- with(pbc, data.frame(Y = nrow(pbc):1, time = sort(time)))
surv <- with(pbc, risk[status[order(time)], ])

plot(Y ~ time, data = risk, type = "l")
points(Y ~ time, data = surv, col = "red")  ## uncensored survivals

plot of chunk unnamed-chunk-5

Then we compute the two non-parametric estimators of the survival function.

### Kaplan-Meier and Nelson-Aalen estimates
surv <- transform(surv, KM = cumprod(1 - 1/Y), NAa = exp(-cumsum(1/Y)))
plot(KM ~ time, data = surv, type = "s", ylim = c(0, 1), xlim = c(0, 4200))
lines(NAa ~ time, data = surv, type = "s", col = "blue")

plot of chunk unnamed-chunk-6

It is difficult to see any differences. We can also compute and plot the estimated cumulative hazard functions. Then it is possible to observe small differences in the tail.

plot(-log(KM) ~ time, data = surv, type = "s", xlim = c(0, 4200))
lines(-log(NAa) ~ time, data = surv, type = "s", col = "blue")

plot of chunk unnamed-chunk-7

Including predictors

A major point in a data set like the PBC data above is that we have covariate information on the individuals. It is no problem to split the estimation of the survival function according to a factor such as the edema variable in the PBC data.

pbcSurv <- survfit(Surv(time, status) ~ edema, data = pbc)
plot(pbcSurv, mark.time = FALSE, conf.int = TRUE, col = c("red", "blue", "purple"))

From the plot it seems clear that individuals without edema have a larger survival time than those with edema, and those treated successfully or left untreated have a larger survival time than those treated unsuccessfully. This can also be tested formally with the so-called log-rank test.

survdiff(Surv(time, status) ~ edema, data = pbc)

## Call:
## survdiff(formula = Surv(time, status) ~ edema, data = pbc)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## edema=0   332      116   145.38      5.94      61.6
## edema=0.5  41       26    13.00     12.99      14.2
## edema=1    20       19     2.61    102.79     105.4
## 
##  Chisq= 123  on 2 degrees of freedom, p= 0

But what about including the other covariates also “–” especially the continuous ones? This can be handled in the framework of the Cox proportional hazards model.

pbcSurv <- coxph(Surv(time, status) ~ as.factor(edema), data = pbc)
summary(pbcSurv)

## Call:
## coxph(formula = Surv(time, status) ~ as.factor(edema), data = pbc)
## 
##   n= 393, number of events= 161 
## 
##                      coef exp(coef) se(coef)    z Pr(>|z|)    
## as.factor(edema)0.5 0.933     2.543    0.217 4.29  1.8e-05 ***
## as.factor(edema)1   2.277     9.747    0.254 8.95  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## as.factor(edema)0.5      2.54      0.393      1.66      3.89
## as.factor(edema)1        9.75      0.103      5.92     16.05
## 
## Concordance= 0.625  (se = 0.013 )
## Rsquare= 0.142   (max possible= 0.988 )
## Likelihood ratio test= 60.4  on 2 df,   p=7.84e-14
## Wald test            = 88.1  on 2 df,   p=0
## Score (logrank) test = 123  on 2 df,   p=0

We can tell from the summary that the \( \beta \)'s corresponding to the covariate edema are significantly different from 0. We also see that the hazard rate for those with edema=1is estimated as roughly 10 times the base-line hazard rate \( \lambda_0(t) \).

We can include more covariates.

pbcSurv <- coxph(Surv(time, status) ~ as.factor(edema) + age + albumin + bili + 
    protime, data = pbc)
summary(pbcSurv)

## Call:
## coxph(formula = Surv(time, status) ~ as.factor(edema) + age + 
##     albumin + bili + protime, data = pbc)
## 
##   n= 391, number of events= 160 
##    (2 observations deleted due to missingness)
## 
##                         coef exp(coef) se(coef)     z Pr(>|z|)    
## as.factor(edema)0.5  0.16133   1.17507  0.23181  0.70  0.48645    
## as.factor(edema)1    1.04124   2.83274  0.29616  3.52  0.00044 ***
## age                  0.03697   1.03766  0.00819  4.52  6.3e-06 ***
## albumin             -1.00671   0.36542  0.20575 -4.89  9.9e-07 ***
## bili                 0.12139   1.12907  0.01348  9.00  < 2e-16 ***
## protime              0.17767   1.19443  0.05827  3.05  0.00230 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## as.factor(edema)0.5     1.175      0.851     0.746     1.851
## as.factor(edema)1       2.833      0.353     1.585     5.062
## age                     1.038      0.964     1.021     1.054
## albumin                 0.365      2.737     0.244     0.547
## bili                    1.129      0.886     1.100     1.159
## protime                 1.194      0.837     1.066     1.339
## 
## Concordance= 0.81  (se = 0.025 )
## Rsquare= 0.371   (max possible= 0.988 )
## Likelihood ratio test= 181  on 6 df,   p=0
## Wald test            = 216  on 6 df,   p=0
## Score (logrank) test = 327  on 6 df,   p=0

It is of course also possible to transform the covariates before using them in the regression. For example,

pbcSurv <- coxph(Surv(time, status) ~ as.factor(edema) + age + log(bili) + log(albumin) + 
    log(protime), data = pbc)
summary(pbcSurv)

## Call:
## coxph(formula = Surv(time, status) ~ as.factor(edema) + age + 
##     log(bili) + log(albumin) + log(protime), data = pbc)
## 
##   n= 391, number of events= 160 
##    (2 observations deleted due to missingness)
## 
##                         coef exp(coef) se(coef)     z Pr(>|z|)    
## as.factor(edema)0.5  0.25301   1.28789  0.22599  1.12   0.2629    
## as.factor(edema)1    0.94466   2.57194  0.28954  3.26   0.0011 ** 
## age                  0.03680   1.03749  0.00773  4.76  1.9e-06 ***
## log(bili)            0.88805   2.43039  0.08471 10.48  < 2e-16 ***
## log(albumin)        -2.61669   0.07304  0.65204 -4.01  6.0e-05 ***
## log(protime)         2.01116   7.47196  0.79429  2.53   0.0113 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## as.factor(edema)0.5     1.288      0.776    0.8270     2.006
## as.factor(edema)1       2.572      0.389    1.4582     4.536
## age                     1.037      0.964    1.0219     1.053
## log(bili)               2.430      0.411    2.0586     2.869
## log(albumin)            0.073     13.690    0.0204     0.262
## log(protime)            7.472      0.134    1.5752    35.444
## 
## Concordance= 0.834  (se = 0.025 )
## Rsquare= 0.445   (max possible= 0.988 )
## Likelihood ratio test= 230  on 6 df,   p=0
## Wald test            = 234  on 6 df,   p=0
## Score (logrank) test = 320  on 6 df,   p=0

One will need tools for model diagnostic. One simple idea is as follows. The survival functions, \( S_i(t) \), for different individuals, \( i \), must fulfill that \[ \log(-\log(S_i(t)) = \log \int_0^t \lambda_0(s) \mathrm{d} s + X_i^T\beta \] are parallel. Now we don't know each individual survival function but if we group individuals with similar covariates we might be able to estimate, non-parametrically, the survival functions for the groups and then visually inspect if the survival functions look parallel.

Considering only one variable, bili, we divide into three groups.

group <- 1 * (pbc$bili > 1.1) + 1 * (pbc$bili > 3.3)
cfit <- survfit(Surv(time, status) ~ group, data = pbc)
plot(cfit, mark.time = FALSE, fun = "cloglog", xlim = c(40, 5000))

plot of chunk unnamed-chunk-13

Including in addition edema, which is already discrete we get a total of 9 groups. Note the additive specification in the formula below, which in reality stratifies the estimation into all 9 possible combinations.

cfit <- survfit(Surv(time, status) ~ group + edema, data = pbc)
plot(cfit, mark.time = FALSE, fun = "cloglog", xlim = c(40, 5000), col = rainbow(9))

plot of chunk unnamed-chunk-14

It is left to the reader to judge the plot …

A good question is what happens if more stratification according to other variables is applied. The above method can be found on page 43-44 in Therneau and Grambsch. It may not be particularly sensible if there are more significant covariates than those that are stratified according to.

If we look at the formula above the “correct” way to proceed is to stratify according to one covariate, estimate the Cox model in each stratum based on the other covariates and then compare the non-parametric estimates of the log of the integrated base line intensities, that is, the first term in the formula for \( \log(-\log(S_i(t)) \). This is described in details in Andersen, Borgan, Gill og Keiding VII.3, pages 539-545.

The procedure above requires that one can estimate the integrated base line intensity non-parametrically “–” and we cannot fall back on the Kaplan-Meier estimator. It is, however, not a problem to construct such an estimator, but we will not pursue this here, see Andersen et al.