We reconsider a data set from Chapter 20 in Frank Harrell's book “Regression modeling strategies”. This is a randomized clinical trial on prostate cancer patients with 4 different treatments (one placebo and three different dosages of estrogen). To assess differences between treatment groups in the survival distribution we do not need to take other variables into account. The purpose of the randomization is to make treatment independent of any other predictor variable, thus allowing us to ascribe differences between the treatment groups to the treatment alone.
However, if we want to predict the survival distribution of individual patients (or subpopulations) we better take other variables into account.
First, the data set is acquired.
con <- url("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/prostate.sav")
print(load(con)) ## Prints "prostate" if data loads correctly.
## [1] "prostate"
close(con)
Then we load the packages that are needed.
library(ggplot2) ### Extensive graphics package
library(reshape2) ### For the 'melt' function
library(lattice) ### More graphics (specifically, scatter plot matrices)
library(hexbin) ### and more graphics
library(survival)
library(xtable) ### for nice html tables
The data set contains the following variables with the following class labels:
names(prostate)
## [1] "patno" "stage" "rx" "dtime" "status" "age" "wt"
## [8] "pf" "hx" "sbp" "dbp" "ekg" "hg" "sz"
## [15] "sg" "ap" "bm" "sdate"
sapply(prostate, function(x) class(x)[1])
## patno stage rx dtime status age
## "labelled" "labelled" "factor" "labelled" "factor" "labelled"
## wt pf hx sbp dbp ekg
## "labelled" "factor" "labelled" "labelled" "labelled" "factor"
## hg sz sg ap bm sdate
## "labelled" "labelled" "labelled" "labelled" "labelled" "labelled"
Harrell has attached labels to the columns as attributes, which we can use to get more informative variable names.
sapply(prostate, function(x) {
if(is.factor(x)) {
attr(x, "levels")
} else {
attr(x, "label")
}
}
)
## $patno
## [1] "Patient Number"
##
## $stage
## [1] "Stage"
##
## $rx
## [1] "placebo" "0.2 mg estrogen" "1.0 mg estrogen" "5.0 mg estrogen"
##
## $dtime
## [1] "Months of Follow-up"
##
## $status
## [1] "alive" "dead - prostatic ca"
## [3] "dead - heart or vascular" "dead - cerebrovascular"
## [5] "dead - pulmonary embolus" "dead - other ca"
## [7] "dead - respiratory disease" "dead - other specific non-ca"
## [9] "dead - unspecified non-ca" "dead - unknown cause"
##
## $age
## [1] "Age in Years"
##
## $wt
## [1] "Weight Index = wt(kg)-ht(cm)+200"
##
## $pf
## [1] "normal activity" "in bed < 50% daytime" "in bed > 50% daytime"
## [4] "confined to bed"
##
## $hx
## [1] "History of Cardiovascular Disease"
##
## $sbp
## [1] "Systolic Blood Pressure/10"
##
## $dbp
## [1] "Diastolic Blood Pressure/10"
##
## $ekg
## [1] "normal" "benign"
## [3] "rhythmic disturb & electrolyte ch" "heart block or conduction def"
## [5] "heart strain" "old MI"
## [7] "recent MI"
##
## $hg
## [1] "Serum Hemoglobin (g/100ml)"
##
## $sz
## [1] "Size of Primary Tumor (cm^2)"
##
## $sg
## [1] "Combined Index of Stage and Hist. Grade"
##
## $ap
## [1] "Serum Prostatic Acid Phosphatase"
##
## $bm
## [1] "Bone Metastases"
##
## $sdate
## [1] "Date on study"
Then we look at the marginal distributions of the variables in terms of histograms and barplots.
To study the continuous variables we exclude the patient number, factors, and some of the numeric variables that are, in fact, dichotomous variables encoded as numerics and the date. Then we make histograms. Subsequently, to study the discrete variables, we consider barplots.
conVar <- c("dtime", "age", "wt", "sbp", "dbp","hg", "sz", "sg", "ap")
disVar <- c( "stage", "rx", "status", "pf", "hx", "ekg", "bm")
meltedProstate <- melt(prostate[, conVar])
qplot(value, data = meltedProstate, geom = "histogram",
xlab = "", ylab = "") +
facet_wrap(~ variable, scales = "free", ncol = 2)
meltedProstate <- melt(prostate[, disVar], id.vars = c())
qplot(value, data = meltedProstate, geom = "bar",
xlab = "", ylab = "") +
facet_wrap(~ variable, scales = "free", ncol = 3) +
theme(axis.text.x = element_text(angle = -30,
size = 8, hjust = 0, vjust = 1))
Some noteworthy observations are that ap has a very skewed and nonuniform distribution,
which suggest that a log-transform is beneficial. We note that pf has a very asymmetric
distribution with only a small number of patients having a nonnormal activity.
We should also observe that 148 out of the total of 502 individuals,
that is, approximately one third, have censored survival times (are still alive).
We also consider the scatter plot matrix of the continuous variables.
splom(na.omit(prostate[, conVar]),
upper.panel = panel.hexbinplot,
pscale = 0,
varname.cex = 0.7,
nbins = 15,
lower.panel = function(x, y) {
panel.text(mean(range(x)), mean(range(y)),
round(cor(x, y), digits = 2),
cex = 0.7
)
}
)
What is most notable is that the two measures of blood preasure are (not surprisingly)
very correlated. We let their mean enter into the model below. They are both also correlated
with weight, which in turn is also correlated with hg. The indicator sg shows some correlation with
the size variable sz as well. Besides the correlation between systolic and diastolic blood presures,
there is no evidence of strong colinearity.
We will for now just drop the missing observations (Harrell gives a treatment of imputation for this particular data set in his Chapter 8).
### The number of missing values for each variable.
sapply(prostate, function(x) sum(is.na(x)))
## patno stage rx dtime status age wt pf hx sbp
## 0 0 0 0 0 1 2 0 0 0
## dbp ekg hg sz sg ap bm sdate
## 0 8 0 5 11 0 0 0
### Dropping observations with missing variables,
### and other data clearning / manipulations.
subProstate <- transform(na.omit(prostate),
patno = as.factor(patno),
dtime = as.numeric(dtime),
bp = (sbp + dbp) / 2, ## Average blood presure.
logap = log(ap)
)
conVar <- c(conVar, "logap", "bp")
To focus on the treatment we turn to a Kaplan-Meier estimator for each of the four treatment groups.
prostateSurv <- survfit(Surv(dtime, status != "alive") ~ rx,
data = subProstate)
plot(prostateSurv, mark.time = FALSE, conf.int = FALSE,
col = c("red", "blue", "purple", "cyan"))
The plot suggests that individuals treated with 1.0 mg estrogen have a longer survival time than the other three groups. Adding confidence bands (which will mesh up the reading of the figure in general) shows that the difference is borderline significant.
We fit Cox's proportional hazards model using the coxph function from
the Survival package. Then we construct tests for each of the variables in
the full model.
form <- Surv(dtime, status != "alive") ~ rx + age + wt + pf +
hx + ekg + hg + sz + sg + logap + bp + bm
prostateCox <- coxph(form, data = subProstate)
testtab <- drop1(prostateCox, test = "Chisq")
ord <- order(testtab[, 4][-1]) + 1
print(xtable(testtab[c(1, ord), ]), type = "html")
| Df | AIC | LRT | Pr(> Chi) | |
|---|---|---|---|---|
| < none> | 3732.35 | |||
| hx | 1 | 3746.87 | 16.52 | 0.0000 |
| sz | 1 | 3741.71 | 11.37 | 0.0007 |
| age | 1 | 3737.55 | 7.21 | 0.0073 |
| hg | 1 | 3734.51 | 4.16 | 0.0414 |
| sg | 1 | 3734.31 | 3.96 | 0.0465 |
| wt | 1 | 3734.18 | 3.83 | 0.0502 |
| ekg | 6 | 3732.05 | 11.70 | 0.0690 |
| rx | 3 | 3732.11 | 5.77 | 0.1235 |
| bm | 1 | 3731.93 | 1.58 | 0.2091 |
| pf | 3 | 3729.85 | 3.50 | 0.3210 |
| bp | 1 | 3730.42 | 0.08 | 0.7829 |
| logap | 1 | 3730.37 | 0.02 | 0.8744 |
The tests suggest that the difference between the four treatment groups can be
explained by other variables. In particular hx (history of cardiovascular disease),
tumor size and age. In fact, these findings suggest that a more appropriate model is
one with competing risks. That is, a model where we model the risk of dying from
the cancer as well as dying from a heart attack or a similar non-cancer releated cause.
The profile likelihood argument for deriving the partial likelihood immediately gives a way to estimate the cumulative baseline hazard function.
w <- predict(prostateCox, type = "risk") ## Individual weights
orddtime <- order(subProstate$dtime)
stat <- (subProstate$status != "alive")[orddtime]
W <- rev(cumsum(w[rev(orddtime)]))
Lambda <- cumsum(stat / W)
plot(subProstate$dtime[orddtime], exp(-Lambda), type = "s",
ylim = c(0, 1), xlim = c(0, 76))
The computations can be carried out using the survfit function on the coxph
object as well, but it is illustrative to understand how the estimator arise from
the profile method.
plot(survfit(prostateCox))
The Cox-Snell residual plot can reveal overall lack of fit. They are based on the result that the transform of the survival times by the cumulative hazard function is exponentially distributed (for continuous distributions) with independent censoring.
The Cox-Snell residuals are just the individual estimated cumulative hazards in the corresponding censored survival times. The Nelson-Aalen estimator of cumulative hazards function for the Cox-Snell residuals plotted against the residuals should be on a straight line.
CSres <- Lambda * w[orddtime] ## Cox-Snell residuals
ordres <- order(CSres)
CSres <- CSres[ordres]
statres <- stat[ordres] ## Indicator of non-censoring
Y <- seq(length(Lambda), 1)[statres]
## Nelson-Aalen estimator of exponential cumulative hazard
plot(CSres[statres], cumsum(1/Y), col = "red")
abline(0, 1)
The Nelson-Aalen estimator is computed “by hand” above. It can also be done using
the survfit function.
tmp <- survfit(Surv(CSres, statres) ~ 1, type = "fleming-harrington")
plot(CSres[statres], -log(tmp$surv)[statres])
abline(0, 1)
Note the use of the type argument. The "fleming-harrington" gives an
estimate of the survial function based on the Nelson-Aalen estimator, so
that \(-\log \hat{S}\) is identical to the Nelson-Aalen estimator (this is
the way to compute the Nelson-Aalen estimator using the survival package).
The default type argument is "kaplan-meier", which of course gives the
Kaplan-Meier estimator of the survival function. In this case, \(-\log \hat{S}\)
will differ slightly from the Nelson-Aalen estimator.
The Cox-Snell residuals are mostly useful for overall assessment of model fit, and not so useful for detecting how the model might fail to fit. In this respect, it resemples the QQ-plot. (For linear models, a heterogenous variance or a lack of fit in the mean value specification can easily pop up as a deviation from Gaussianity in a QQ-plot.)
Martingale or deviance residuals are more, like regression residuals, useful for investigations of aspects of the model fit, e.g. how the predictors enter into the model.
prostateData <- subProstate[, c("patno", all.vars(form)[-c(1, 2)])]
predProstate <- cbind(prostateData,
data.frame(mgres = residuals(prostateCox, type = "martingale"))
## Change the type argument to deviance to get deviance residuals
)
meltedPredProstate <- melt(predProstate, id.vars = c("patno", "mgres"),
measure.vars = conVar[conVar %in% all.vars(form)[-c(1, 2)]])
qplot(value, mgres, data = meltedPredProstate,
xlab = "", ylab = "", geom = "point", alpha = I(0.3), size = I(3)) +
facet_wrap(~ variable, scales = "free_x", ncol = 2) +
geom_smooth(size = 1, fill = "blue")
These plot do not suggest obvious misfits, but we attempt a model with nonlinear effects of all continuous predictors anyway.
form <- Surv(dtime, status != "alive") ~ rx + ns(age, 4) + ns(wt, 4) + pf + hx +
ekg + ns(hg, 4) + ns(sz, 4) + ns(sg, 4) +
ns(logap, 4) + ns(bp, 4) + bm
prostateCox2 <- coxph(form, data = subProstate)
print(xtable(drop1(prostateCox2, test = "Chisq")[ord, ]), type = "html")
| Df | AIC | LRT | Pr(> Chi) | |
|---|---|---|---|---|
| hx | 1 | 3752.90 | 16.71 | 0.0000 |
| ns(sz, 4) | 4 | 3744.19 | 14.00 | 0.0073 |
| ns(age, 4) | 4 | 3745.07 | 14.88 | 0.0050 |
| ns(hg, 4) | 4 | 3739.74 | 9.54 | 0.0489 |
| ns(sg, 4) | 4 | 3734.27 | 4.08 | 0.3958 |
| ns(wt, 4) | 4 | 3739.07 | 8.87 | 0.0644 |
| ekg | 6 | 3740.46 | 14.27 | 0.0268 |
| rx | 3 | 3736.51 | 4.31 | 0.2295 |
| bm | 1 | 3736.49 | 0.30 | 0.5857 |
| pf | 3 | 3738.87 | 6.68 | 0.0828 |
| ns(bp, 4) | 4 | 3733.10 | 2.91 | 0.5731 |
| ns(logap, 4) | 4 | 3737.88 | 7.68 | 0.1039 |
We can test the model with only linear effects against this larger model using the
anova method for coxph objects.
print(xtable(anova(prostateCox2, prostateCox)), type = "html")
| loglik | Chisq | Df | P(> |Chi|) | |
|---|---|---|---|---|
| 1 | -1827.10 | |||
| 2 | -1845.17 | 36.15 | 21 | 0.0210 |
It is borderline significant showing that there is some, but not a lot, evidence in the data for nonlinear relations.
A natural question with the proportional hazards model is whether the proportional hazards assumption holds. This can be investigated by excluding a variable and visually inspect different baseline estimates for the variable. This works well for factor variables with few levels.
To illustrate the method, we first return to the model where we only include treatment. But now we fit the proportional hazards model and consider the different estimated parameters for the treatments.
prostateCox0 <- coxph(Surv(dtime, status != "alive") ~ rx, data = subProstate)
print(xtable(coef(summary(prostateCox0))), type = "html")
| coef | exp(coef) | se(coef) | z | Pr(> |z|) | |
|---|---|---|---|---|---|
| rx0.2 mg estrogen | 0.03 | 1.03 | 0.15 | 0.21 | 0.83 |
| rx1.0 mg estrogen | -0.37 | 0.69 | 0.16 | -2.29 | 0.02 |
| rx5.0 mg estrogen | 0.03 | 1.03 | 0.15 | 0.17 | 0.86 |
This shows, in concordance with the findings above, that the group given estrogen treatment with a dosage of 1.0 mg have an decreased risk of dying (the hazard rate is lowered), and that this effect is borderline significant. However, what about the proportional hazards assumption? We check that by checking if the Nelson-Aalen estimates of cumulative hazard functions within each of the four treatment groups are, in fact, proportional.
## The cloglog transform gives the log cumulative hazards plotted
## against the log times
plot(prostateSurv, mark.time = FALSE, conf.int = FALSE,
col = c("red", "blue", "purple", "cyan"), fun = "cloglog")
The proportional hazards assumption is not perfect. If it holds, the four curves must be roughly parallel. If we have a model with more predictors, we can investigate a single one of them using a similar technique. We fit the model using all other predictors and then nonparametrically estimate the baseline for each value of the factor. This is done below for the treatment variable.
prostateCox2a <- update(prostateCox2, . ~ . - rx)
w <- predict(prostateCox2a, type = "risk") ## Individual weights
orddtime <- order(subProstate$dtime)
stat <- (subProstate$status != "alive")[orddtime]
rxord <- subProstate$rx[orddtime]
W <- tapply(w[orddtime], rxord, function(ww) rev(cumsum(rev(ww))))
statstrat <- tapply(stat, rxord, function(x) x)
color <- c("red", "blue", "purple", "cyan")
Lambda1 <- cumsum(statstrat[[1]] / W[[1]])
plot(log(subProstate$dtime[orddtime][as.numeric(rxord) == 1]), log(Lambda1),
type = "s", col = color[1])
for(i in 2:4) {
Lambda1 <- cumsum(statstrat[[i]] / W[[i]])
lines(log(subProstate$dtime[orddtime][as.numeric(rxord) == i]), log(Lambda1),
type = "s", col = color[i])
}
The conclusion is the same (not surprisingly, because treatment is by randomization independent of the remaining predictors).
As always, we should try to investigate and present the fitted model. Here we give curves of the fitted terms for each each continuous variables and fitted parameters with confidence bands for each parameter associated with a factor variable.
predictProstate <- predict(prostateCox2, type = "terms", se.fit = TRUE, reference = "strata")
predictData <- melt(subProstate[, c("patno", conVar[-c(1, 4, 5, 9)])],
id.vars = "patno")
selectedTerms <- which(all.vars(form)[- c(1, 2)] %in% conVar)
plotData <- cbind(predictData,
se = melt(predictProstate$se.fit[, selectedTerms])$value,
y = melt(predictProstate$fit[, selectedTerms])$value
)
ggplot(plotData, aes(x = value), xlab = "", ylab = "") +
geom_ribbon(aes(ymin = y - 2*se, ymax = y + 2*se),
alpha = I(0.2),
fill = I("blue")) +
geom_line(aes(y = y), size = 1) +
geom_rug(data = predictData,
aes(x = value, y = NULL),
colour = I("black"),
alpha = I(0.1)
) +
facet_wrap(~ variable, ncol = 2, scale = "free")
We note that Age and sz (and perhaps also hg) show some nonlinearity.
predictData <- melt(subProstate[, c("patno", disVar[-c(1, 3)])],
id.vars = "patno")
selectedTerms <- which(all.vars(form)[-c(1, 2)] %in% disVar)
plotData <- cbind(predictData,
se = melt(predictProstate$se.fit[, selectedTerms])$value,
y = melt(predictProstate$fit[, selectedTerms])$value
)
## We subset to get just one prediction for each factor value. This is a
## little clumpsy, and could probably be improved.
plotData <- plotData[!duplicated(interaction(predictData[, c("variable", "value")])), ]
ggplot(plotData, aes(x = value), xlab = "", ylab = "") +
geom_errorbar(aes(ymin = y - 2*se, ymax = y + 2*se),
colour = I("blue")) +
geom_point(aes(y = y), size = 3) +
facet_wrap(~ variable, ncol = 2, scale = "free") +
theme(axis.text.x = element_text(angle = -30,
size = 8, hjust = 0, vjust = 1))
Note in the plots above how the size of the confidence intervals depend on how many observations we have with the given factor value. Observe that all predictions can only be interpreted relatively.