Survival Case Study

Introduction

We reconsider a data set from Chapter 20 in Frank Harrell's book “Regression modeling strategies”. This is a randomized clinical trial on prostate cancer patients with 4 different treatments (one placebo and three different dosages of estrogen). To assess differences between treatment groups in the survival distribution we do not need to take other variables into account. The purpose of the randomization is to make treatment independent of any other predictor variable, thus allowing us to ascribe differences between the treatment groups to the treatment alone.

However, if we want to predict the survival distribution of individual patients (or subpopulations) we better take other variables into account.

First, the data set is acquired.

con <- url("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/prostate.sav")
print(load(con)) ## Prints "prostate" if data loads correctly.

## [1] "prostate"

close(con)

Then we load the packages that are needed.

library(ggplot2)  ### Extensive graphics package
library(reshape2) ### For the 'melt' function
library(lattice)  ### More graphics (specifically, scatter plot matrices)
library(hexbin)   ### and more graphics
library(survival)
library(xtable)   ### for nice html tables

Descriptive statistics

The data set contains the following variables with the following class labels:

names(prostate)

##  [1] "patno"  "stage"  "rx"     "dtime"  "status" "age"    "wt"    
##  [8] "pf"     "hx"     "sbp"    "dbp"    "ekg"    "hg"     "sz"    
## [15] "sg"     "ap"     "bm"     "sdate"

sapply(prostate, function(x) class(x)[1])

##      patno      stage         rx      dtime     status        age 
## "labelled" "labelled"   "factor" "labelled"   "factor" "labelled" 
##         wt         pf         hx        sbp        dbp        ekg 
## "labelled"   "factor" "labelled" "labelled" "labelled"   "factor" 
##         hg         sz         sg         ap         bm      sdate 
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"

Harrell has attached labels to the columns as attributes, which we can use to get more informative variable names.


sapply(prostate, function(x) {
  if(is.factor(x)) {
    attr(x, "levels")
  } else {
    attr(x, "label")
  }
}
       )

## $patno
## [1] "Patient Number"
## 
## $stage
## [1] "Stage"
## 
## $rx
## [1] "placebo"         "0.2 mg estrogen" "1.0 mg estrogen" "5.0 mg estrogen"
## 
## $dtime
## [1] "Months of Follow-up"
## 
## $status
##  [1] "alive"                        "dead - prostatic ca"         
##  [3] "dead - heart or vascular"     "dead - cerebrovascular"      
##  [5] "dead - pulmonary embolus"     "dead - other ca"             
##  [7] "dead - respiratory disease"   "dead - other specific non-ca"
##  [9] "dead - unspecified non-ca"    "dead - unknown cause"        
## 
## $age
## [1] "Age in Years"
## 
## $wt
## [1] "Weight Index = wt(kg)-ht(cm)+200"
## 
## $pf
## [1] "normal activity"      "in bed < 50% daytime" "in bed > 50% daytime"
## [4] "confined to bed"     
## 
## $hx
## [1] "History of Cardiovascular Disease"
## 
## $sbp
## [1] "Systolic Blood Pressure/10"
## 
## $dbp
## [1] "Diastolic Blood Pressure/10"
## 
## $ekg
## [1] "normal"                            "benign"                           
## [3] "rhythmic disturb & electrolyte ch" "heart block or conduction def"    
## [5] "heart strain"                      "old MI"                           
## [7] "recent MI"                        
## 
## $hg
## [1] "Serum Hemoglobin (g/100ml)"
## 
## $sz
## [1] "Size of Primary Tumor (cm^2)"
## 
## $sg
## [1] "Combined Index of Stage and Hist. Grade"
## 
## $ap
## [1] "Serum Prostatic Acid Phosphatase"
## 
## $bm
## [1] "Bone Metastases"
## 
## $sdate
## [1] "Date on study"

Then we look at the marginal distributions of the variables in terms of histograms and barplots.

To study the continuous variables we exclude the patient number, factors, and some of the numeric variables that are, in fact, dichotomous variables encoded as numerics and the date. Then we make histograms. Subsequently, to study the discrete variables, we consider barplots.

conVar <- c("dtime", "age", "wt", "sbp", "dbp","hg", "sz", "sg", "ap")
disVar <- c( "stage", "rx", "status", "pf", "hx", "ekg", "bm")

meltedProstate <- melt(prostate[, conVar])
qplot(value, data = meltedProstate, geom = "histogram",
      xlab = "", ylab = "") + 
  facet_wrap(~ variable, scales = "free", ncol = 2)

plot of chunk histograms


meltedProstate <- melt(prostate[, disVar], id.vars = c())

qplot(value, data = meltedProstate, geom = "bar",
      xlab = "", ylab = "") + 
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  theme(axis.text.x = element_text(angle = -30, 
         size = 8, hjust = 0, vjust = 1))

plot of chunk histograms

Some noteworthy observations are that ap has a very skewed and nonuniform distribution, which suggest that a log-transform is beneficial. We note that pf has a very asymmetric distribution with only a small number of patients having a nonnormal activity.
We should also observe that 148 out of the total of 502 individuals, that is, approximately one third, have censored survival times (are still alive).

We also consider the scatter plot matrix of the continuous variables.


  splom(na.omit(prostate[, conVar]),
        upper.panel = panel.hexbinplot,
        pscale = 0, 
        varname.cex = 0.7,
        nbins = 15,
        lower.panel = function(x, y) {
          panel.text(mean(range(x)), mean(range(y)), 
                     round(cor(x, y), digits = 2),
                     cex = 0.7
                     )
        }
        )

plot of chunk ScatterPlotMatrix

What is most notable is that the two measures of blood preasure are (not surprisingly) very correlated. We let their mean enter into the model below. They are both also correlated with weight, which in turn is also correlated with hg. The indicator sg shows some correlation with the size variable sz as well. Besides the correlation between systolic and diastolic blood presures, there is no evidence of strong colinearity.

We will for now just drop the missing observations (Harrell gives a treatment of imputation for this particular data set in his Chapter 8).

### The number of missing values for each variable. 

sapply(prostate, function(x) sum(is.na(x)))

##  patno  stage     rx  dtime status    age     wt     pf     hx    sbp 
##      0      0      0      0      0      1      2      0      0      0 
##    dbp    ekg     hg     sz     sg     ap     bm  sdate 
##      0      8      0      5     11      0      0      0


### Dropping observations with missing variables,
### and other data clearning / manipulations.

subProstate <- transform(na.omit(prostate),
                         patno = as.factor(patno),
                         dtime = as.numeric(dtime),
                         bp = (sbp + dbp) / 2,     ## Average blood presure.
                         logap = log(ap)
                         )
conVar <- c(conVar, "logap", "bp")

To focus on the treatment we turn to a Kaplan-Meier estimator for each of the four treatment groups.

prostateSurv <- survfit(Surv(dtime, status != "alive") ~ rx, 
                                 data = subProstate)
plot(prostateSurv, mark.time = FALSE, conf.int = FALSE, 
     col = c("red", "blue", "purple", "cyan"))

plot of chunk surv1

The plot suggests that individuals treated with 1.0 mg estrogen have a longer survival time than the other three groups. Adding confidence bands (which will mesh up the reading of the figure in general) shows that the difference is borderline significant.

The Cox model

We fit Cox's proportional hazards model using the coxph function from the Survival package. Then we construct tests for each of the variables in the full model.

form <- Surv(dtime, status != "alive") ~ rx + age + wt + pf + 
  hx + ekg + hg + sz + sg + logap + bp + bm
prostateCox <- coxph(form, data = subProstate)
testtab <- drop1(prostateCox, test = "Chisq")
ord <- order(testtab[, 4][-1]) + 1
print(xtable(testtab[c(1, ord), ]), type = "html")

	Df	AIC	LRT	Pr(> Chi)
< none>		3732.35
hx	1	3746.87	16.52	0.0000
sz	1	3741.71	11.37	0.0007
age	1	3737.55	7.21	0.0073
hg	1	3734.51	4.16	0.0414
sg	1	3734.31	3.96	0.0465
wt	1	3734.18	3.83	0.0502
ekg	6	3732.05	11.70	0.0690
rx	3	3732.11	5.77	0.1235
bm	1	3731.93	1.58	0.2091
pf	3	3729.85	3.50	0.3210
bp	1	3730.42	0.08	0.7829
logap	1	3730.37	0.02	0.8744

The tests suggest that the difference between the four treatment groups can be explained by other variables. In particular hx (history of cardiovascular disease), tumor size and age. In fact, these findings suggest that a more appropriate model is one with competing risks. That is, a model where we model the risk of dying from the cancer as well as dying from a heart attack or a similar non-cancer releated cause.

Baseline estimation

The profile likelihood argument for deriving the partial likelihood immediately gives a way to estimate the cumulative baseline hazard function.

w <- predict(prostateCox, type = "risk")  ## Individual weights
orddtime <- order(subProstate$dtime)
stat <- (subProstate$status != "alive")[orddtime]
W <- rev(cumsum(w[rev(orddtime)]))
Lambda <- cumsum(stat / W)
plot(subProstate$dtime[orddtime], exp(-Lambda), type = "s",
     ylim = c(0, 1), xlim = c(0, 76))

plot of chunk baseline

The computations can be carried out using the survfit function on the coxph object as well, but it is illustrative to understand how the estimator arise from the profile method.

plot(survfit(prostateCox))

plot of chunk baselineSurvfit

Residuals

The Cox-Snell residual plot can reveal overall lack of fit. They are based on the result that the transform of the survival times by the cumulative hazard function is exponentially distributed (for continuous distributions) with independent censoring.

The Cox-Snell residuals are just the individual estimated cumulative hazards in the corresponding censored survival times. The Nelson-Aalen estimator of cumulative hazards function for the Cox-Snell residuals plotted against the residuals should be on a straight line.

CSres <- Lambda * w[orddtime] ## Cox-Snell residuals
ordres <- order(CSres) 
CSres <- CSres[ordres]
statres <- stat[ordres] ## Indicator of non-censoring 
Y <- seq(length(Lambda), 1)[statres]
## Nelson-Aalen estimator of exponential cumulative hazard
plot(CSres[statres], cumsum(1/Y), col = "red")
abline(0, 1)

plot of chunk CoxSnell

The Nelson-Aalen estimator is computed “by hand” above. It can also be done using the survfit function.

tmp <- survfit(Surv(CSres, statres) ~ 1, type = "fleming-harrington")
plot(CSres[statres], -log(tmp$surv)[statres])
abline(0, 1)

plot of chunk CoxSnell2

Note the use of the type argument. The "fleming-harrington" gives an estimate of the survial function based on the Nelson-Aalen estimator, so that \(-\log \hat{S}\) is identical to the Nelson-Aalen estimator (this is the way to compute the Nelson-Aalen estimator using the survival package). The default type argument is "kaplan-meier", which of course gives the Kaplan-Meier estimator of the survival function. In this case, \(-\log \hat{S}\) will differ slightly from the Nelson-Aalen estimator.

The Cox-Snell residuals are mostly useful for overall assessment of model fit, and not so useful for detecting how the model might fail to fit. In this respect, it resemples the QQ-plot. (For linear models, a heterogenous variance or a lack of fit in the mean value specification can easily pop up as a deviation from Gaussianity in a QQ-plot.)

Martingale or deviance residuals are more, like regression residuals, useful for investigations of aspects of the model fit, e.g. how the predictors enter into the model.

prostateData <- subProstate[, c("patno", all.vars(form)[-c(1, 2)])]
predProstate <- cbind(prostateData, 
                      data.frame(mgres = residuals(prostateCox, type = "martingale"))
                      ## Change the type argument to deviance to get deviance residuals
                      )

meltedPredProstate <- melt(predProstate, id.vars = c("patno", "mgres"),
                           measure.vars = conVar[conVar %in% all.vars(form)[-c(1, 2)]])

qplot(value, mgres, data = meltedPredProstate,
      xlab = "", ylab = "", geom = "point", alpha = I(0.3), size = I(3)) + 
  facet_wrap(~ variable, scales = "free_x", ncol = 2) + 
  geom_smooth(size = 1, fill = "blue")

plot of chunk residualPlots

These plot do not suggest obvious misfits, but we attempt a model with nonlinear effects of all continuous predictors anyway.

form <- Surv(dtime, status != "alive") ~ rx + ns(age, 4) + ns(wt, 4) + pf + hx + 
  ekg +  ns(hg, 4) + ns(sz, 4) + ns(sg, 4) + 
  ns(logap, 4) + ns(bp, 4) + bm
prostateCox2 <- coxph(form, data = subProstate)
print(xtable(drop1(prostateCox2, test = "Chisq")[ord, ]), type = "html")

	Df	AIC	LRT	Pr(> Chi)
hx	1	3752.90	16.71	0.0000
ns(sz, 4)	4	3744.19	14.00	0.0073
ns(age, 4)	4	3745.07	14.88	0.0050
ns(hg, 4)	4	3739.74	9.54	0.0489
ns(sg, 4)	4	3734.27	4.08	0.3958
ns(wt, 4)	4	3739.07	8.87	0.0644
ekg	6	3740.46	14.27	0.0268
rx	3	3736.51	4.31	0.2295
bm	1	3736.49	0.30	0.5857
pf	3	3738.87	6.68	0.0828
ns(bp, 4)	4	3733.10	2.91	0.5731
ns(logap, 4)	4	3737.88	7.68	0.1039

We can test the model with only linear effects against this larger model using the anova method for coxph objects.

print(xtable(anova(prostateCox2, prostateCox)), type = "html")

	loglik	Chisq	Df	P(> \|Chi\|)
1	-1827.10
2	-1845.17	36.15	21	0.0210

It is borderline significant showing that there is some, but not a lot, evidence in the data for nonlinear relations.

The proportional hazards assumption

A natural question with the proportional hazards model is whether the proportional hazards assumption holds. This can be investigated by excluding a variable and visually inspect different baseline estimates for the variable. This works well for factor variables with few levels.

To illustrate the method, we first return to the model where we only include treatment. But now we fit the proportional hazards model and consider the different estimated parameters for the treatments.

prostateCox0 <- coxph(Surv(dtime, status != "alive") ~ rx, data = subProstate)
print(xtable(coef(summary(prostateCox0))), type = "html")

	coef	exp(coef)	se(coef)	z	Pr(> \|z\|)
rx0.2 mg estrogen	0.03	1.03	0.15	0.21	0.83
rx1.0 mg estrogen	-0.37	0.69	0.16	-2.29	0.02
rx5.0 mg estrogen	0.03	1.03	0.15	0.17	0.86

This shows, in concordance with the findings above, that the group given estrogen treatment with a dosage of 1.0 mg have an decreased risk of dying (the hazard rate is lowered), and that this effect is borderline significant. However, what about the proportional hazards assumption? We check that by checking if the Nelson-Aalen estimates of cumulative hazard functions within each of the four treatment groups are, in fact, proportional.

## The cloglog transform gives the log cumulative hazards plotted
## against the log times
plot(prostateSurv, mark.time = FALSE, conf.int = FALSE, 
     col = c("red", "blue", "purple", "cyan"), fun = "cloglog")

plot of chunk propHazardCheck0

The proportional hazards assumption is not perfect. If it holds, the four curves must be roughly parallel. If we have a model with more predictors, we can investigate a single one of them using a similar technique. We fit the model using all other predictors and then nonparametrically estimate the baseline for each value of the factor. This is done below for the treatment variable.

prostateCox2a <- update(prostateCox2, . ~ . - rx)
w <- predict(prostateCox2a, type = "risk")  ## Individual weights
orddtime <- order(subProstate$dtime)
stat <- (subProstate$status != "alive")[orddtime]
rxord <- subProstate$rx[orddtime]
W <- tapply(w[orddtime], rxord, function(ww) rev(cumsum(rev(ww))))
statstrat <- tapply(stat, rxord, function(x) x)
color <- c("red", "blue", "purple", "cyan")
Lambda1 <- cumsum(statstrat[[1]] / W[[1]])
plot(log(subProstate$dtime[orddtime][as.numeric(rxord) == 1]), log(Lambda1), 
     type = "s", col = color[1])
for(i in 2:4) {
  Lambda1 <- cumsum(statstrat[[i]] / W[[i]])
  lines(log(subProstate$dtime[orddtime][as.numeric(rxord) == i]), log(Lambda1), 
        type = "s", col = color[i])
  }

plot of chunk unnamed-chunk-1

The conclusion is the same (not surprisingly, because treatment is by randomization independent of the remaining predictors).

Model description / reporting the model

As always, we should try to investigate and present the fitted model. Here we give curves of the fitted terms for each each continuous variables and fitted parameters with confidence bands for each parameter associated with a factor variable.

predictProstate <- predict(prostateCox2, type = "terms", se.fit = TRUE, reference = "strata")
predictData <- melt(subProstate[, c("patno", conVar[-c(1, 4, 5, 9)])], 
                    id.vars = "patno")

selectedTerms <- which(all.vars(form)[- c(1, 2)] %in% conVar)

plotData <- cbind(predictData,
                  se = melt(predictProstate$se.fit[, selectedTerms])$value,
                  y = melt(predictProstate$fit[, selectedTerms])$value
                  )

ggplot(plotData, aes(x = value), xlab = "", ylab = "") +
  geom_ribbon(aes(ymin = y - 2*se, ymax = y + 2*se),
              alpha = I(0.2),
              fill = I("blue")) +
  geom_line(aes(y = y), size = 1) +
  geom_rug(data = predictData,
           aes(x = value, y = NULL), 
           colour = I("black"), 
           alpha = I(0.1)
           ) +
  facet_wrap(~ variable, ncol = 2, scale = "free")

plot of chunk termplot

We note that Age and sz (and perhaps also hg) show some nonlinearity.

predictData <- melt(subProstate[, c("patno", disVar[-c(1, 3)])], 
                    id.vars = "patno")

selectedTerms <- which(all.vars(form)[-c(1, 2)] %in% disVar)

plotData <- cbind(predictData,
                  se = melt(predictProstate$se.fit[, selectedTerms])$value,
                  y = melt(predictProstate$fit[, selectedTerms])$value
                  )

## We subset to get just one prediction for each factor value. This is a 
## little clumpsy, and could probably be improved.
plotData <- plotData[!duplicated(interaction(predictData[, c("variable", "value")])), ]

ggplot(plotData, aes(x = value), xlab = "", ylab = "") +
  geom_errorbar(aes(ymin = y - 2*se, ymax = y + 2*se),
                colour = I("blue")) +
  geom_point(aes(y = y), size = 3) +
  facet_wrap(~ variable, ncol = 2, scale = "free") +
  theme(axis.text.x = element_text(angle = -30, 
         size = 8, hjust = 0, vjust = 1))

plot of chunk termplot2

Note in the plots above how the size of the confidence intervals depend on how many observations we have with the given factor value. Observe that all predictions can only be interpreted relatively.