We use the vegetables data set to illustrate computation of confidence intervals and the use of bootstrapping. First, the data set is loaded and cleaned as in the lecture notes.
vegetables <- read.table("http://www.math.ku.dk/~richard/regression/data/vegetablesSale.txt",
header = TRUE, colClasses = c("numeric", "numeric", "factor", "factor",
"numeric", "numeric", "numeric"))
vegetables <- subset(vegetables, week != 2)
naid <- is.na(vegetables$discount)
impute <- with(subset(vegetables, !is.na(discount) & week == 4), c(1, median(discount),
median(discountSEK)))
vegetables[naid, c("ad", "discount", "discountSEK")] <- impute
vegetables <- within(vegetables, {
meanNormSale <- sort(tapply(normalSale, store, mean))
store <- factor(store, levels = names(meanNormSale))
meanNormSale <- meanNormSale[store]
})
Then we fit a model as in the notes using the Poisson model, and we compute
confidence intervals for the two regression parameters associated with the variables
ad
and discount
. We first compute standard confidence intervals based on
estimates of the standard errors. This is achieved using the confint.default
function. Then we compute likelihood intervals based on profile likelihood
functions. This is achieved by calling the generic confint
function,
which for glm-objects will invoke a method that produces profile intervals.
form <- sale ~ offset(log(normalSale)) + store + ad + discount - 1
vegetablesGlm <- glm(form, family = poisson, data = vegetables)
confint.default(vegetablesGlm, c("ad1", "discount"))
## 2.5 % 97.5 %
## ad1 0.42229 0.51620
## discount 0.03955 0.04856
confint(vegetablesGlm, c("ad1", "discount"))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## ad1 0.42230 0.51621
## discount 0.03956 0.04857
The standard intervals and the likelihood intervals are in this case almost identical.
We then turn to non-parametric boostrapping using pair sampling. The implementation
is based on sample
for sampling indices from 1 to n
, the number of cases,
with replacement. We store only the two parameters of interest for each fitted
model. An alternative is to store the entire glm-object and then subsequently
extract what is needed.
B <- 999
n <- nrow(vegetables)
beta <- matrix(0, B, 2)
for (b in 1:B) {
i <- sample(n, n, replace = TRUE)
bootGlm <- glm(form, family = poisson, data = vegetables[i, ])
beta[b, ] <- coefficients(bootGlm)[c("ad1", "discount")]
}
An alternative to the non-parametric bootstrapping is to sample conditionally
on the predictors from the fitted Poisson model. This is almost as easy as
non-parametric bootstrapping, but this time using the rpois
function for
the simulation based on the fitted model.
parbeta <- matrix(0, B, 2)
vegetablesSamp <- vegetables
muhat <- predict(vegetablesGlm, type = "response")
for (b in 1:B) {
vegetablesSamp$sale <- rpois(n, muhat)
bootGlm <- glm(form, family = poisson, data = vegetablesSamp)
parbeta[b, ] <- coefficients(bootGlm)[c("ad1", "discount")]
}
We will compute two bootstrap confidence intervals below, but first we take a look at the standard deviations estimated from the bootstrap samples. These are estimates of the standard errors, and can be used for computing standard confidence intervals.
sebeta <- apply(beta, 2, sd)
separbeta <- apply(parbeta, 2, sd)
## Standard errors based on non-parametric bootstrapping
sebeta
## [1] 0.15800 0.01076
## Standard errors based on parametric bootstrapping
separbeta
## [1] 0.023889 0.002298
## Standard errors based on ordinary analytic approximations
coef(summary(vegetablesGlm))[c("ad1", "discount"), 2]
## ad1 discount
## 0.023956 0.002298
We observe that the standard errors based on parametric bootstrapping are almost identical to those obtained from the analytic approximation. Those based on non-parametric bootstrapping are, however, 5-7 times larger.
We can compute confidence intervals based on the bootstrapped estimates of standard error. These intervals are in this case much wider than those based on the Poisson model. The explanation is the overdispersion and the poor fit of the variance model, that we observed for the Poisson model.
betahat <- coefficients(vegetablesGlm)[c("ad1", "discount")]
betahat + 1.96 * sebeta %o% c(-1, 1)
## [,1] [,2]
## [1,] 0.15957 0.77891
## [2,] 0.02297 0.06514
An alternative is to use another combinant and the quantiles from its bootstrapped distribution. Using the simple combinant \[ \hat{\beta} - \beta \] results in the following confidence intervals.
betastar <- scale(beta, betahat, scale = FALSE)
qbeta <- t(apply(betastar, 2, quantile, probs = c(0.975, 0.025)))
betahat - qbeta
## 97.5% 2.5%
## [1,] 0.20540 0.81007
## [2,] 0.02612 0.06687
We observe that these confidence intervals are moved slightly to the right. If we consider the density estimate of the boostrapped distribution we see a left skewness in the distribution, which is what is reflected as a translation to the right of the confidence intervals.
library(ggplot2)
library(reshape2)
mbeta <- data.frame(beta)
names(mbeta) <- c("ad1", "discount")
mbeta <- melt(mbeta)
qplot(x = value, data = mbeta, fill = I("gray"), geom = "density") + facet_wrap(~variable,
scales = "free", ncol = 1)
The choice of \( B = 999 \) above is the recommended choice if we want to compute the quantile based confidence intervals. A choice of \( B = 200 \) is sufficient if we just want to estimate standard errors.
For comparison we compute confidence intervals based on the quasi Poisson model and the \( \Gamma \)-model.
vegetablesGlm2 <- glm(form, family = quasipoisson, data = vegetables)
confint.default(vegetablesGlm2, c("ad1", "discount"))
## 2.5 % 97.5 %
## ad1 0.30806 0.63043
## discount 0.02859 0.05952
The use of the quasi Poisson model clearly improves the confidence intervals, but they are still not as wide as those obtained from non-paramtric bootstrapping. The quasi Poisson model is only capable of correcting for overdispersion, and does not correct for a lack of fit of the variance function.
vegetablesGlm3 <- glm(form, family = Gamma("log"), data = vegetables)
confint.default(vegetablesGlm3, c("ad1", "discount"))
## 2.5 % 97.5 %
## ad1 0.004606 0.36234
## discount 0.010511 0.03825
Though the \( \Gamma \)-model with the log-link has the same mean value structure as the Poisson model, the fit and the confidence intervals are quite different. The fit will be different because the weights obtained from the variance function are different. Cases with a large expected value will be weighted further down when we use the \( \Gamma \)-model, and will thus contribute less to the fit. However, I also suspect that there is a lack of fit of the linear model of discount, and that this is the reason that the change of model causes a dramatic change in the fitted model and the confidence intervals. This has to be investigated further.