Confidence intervals and generalized linear models

We use the vegetables data set to illustrate computation of confidence intervals and the use of bootstrapping. First, the data set is loaded and cleaned as in the lecture notes.

vegetables <- read.table("http://www.math.ku.dk/~richard/regression/data/vegetablesSale.txt", 
    header = TRUE, colClasses = c("numeric", "numeric", "factor", "factor", 
        "numeric", "numeric", "numeric"))
vegetables <- subset(vegetables, week != 2)
naid <- is.na(vegetables$discount)
impute <- with(subset(vegetables, !is.na(discount) & week == 4), c(1, median(discount), 
    median(discountSEK)))
vegetables[naid, c("ad", "discount", "discountSEK")] <- impute

vegetables <- within(vegetables, {
    meanNormSale <- sort(tapply(normalSale, store, mean))
    store <- factor(store, levels = names(meanNormSale))
    meanNormSale <- meanNormSale[store]
})

Then we fit a model as in the notes using the Poisson model, and we compute confidence intervals for the two regression parameters associated with the variables ad and discount. We first compute standard confidence intervals based on estimates of the standard errors. This is achieved using the confint.default function. Then we compute likelihood intervals based on profile likelihood functions. This is achieved by calling the generic confint function, which for glm-objects will invoke a method that produces profile intervals.

form <- sale ~ offset(log(normalSale)) + store + ad + discount - 1
vegetablesGlm <- glm(form, family = poisson, data = vegetables)
confint.default(vegetablesGlm, c("ad1", "discount"))
##            2.5 %  97.5 %
## ad1      0.42229 0.51620
## discount 0.03955 0.04856
confint(vegetablesGlm, c("ad1", "discount"))
## Waiting for profiling to be done...
##            2.5 %  97.5 %
## ad1      0.42230 0.51621
## discount 0.03956 0.04857

The standard intervals and the likelihood intervals are in this case almost identical.

Bootstrapping

We then turn to non-parametric boostrapping using pair sampling. The implementation is based on sample for sampling indices from 1 to n, the number of cases, with replacement. We store only the two parameters of interest for each fitted model. An alternative is to store the entire glm-object and then subsequently extract what is needed.

B <- 999
n <- nrow(vegetables)
beta <- matrix(0, B, 2)
for (b in 1:B) {
    i <- sample(n, n, replace = TRUE)
    bootGlm <- glm(form, family = poisson, data = vegetables[i, ])
    beta[b, ] <- coefficients(bootGlm)[c("ad1", "discount")]
}

An alternative to the non-parametric bootstrapping is to sample conditionally on the predictors from the fitted Poisson model. This is almost as easy as non-parametric bootstrapping, but this time using the rpois function for the simulation based on the fitted model.

parbeta <- matrix(0, B, 2)
vegetablesSamp <- vegetables
muhat <- predict(vegetablesGlm, type = "response")
for (b in 1:B) {
    vegetablesSamp$sale <- rpois(n, muhat)
    bootGlm <- glm(form, family = poisson, data = vegetablesSamp)
    parbeta[b, ] <- coefficients(bootGlm)[c("ad1", "discount")]
}

We will compute two bootstrap confidence intervals below, but first we take a look at the standard deviations estimated from the bootstrap samples. These are estimates of the standard errors, and can be used for computing standard confidence intervals.

sebeta <- apply(beta, 2, sd)
separbeta <- apply(parbeta, 2, sd)
## Standard errors based on non-parametric bootstrapping
sebeta
## [1] 0.15800 0.01076
## Standard errors based on parametric bootstrapping
separbeta
## [1] 0.023889 0.002298
## Standard errors based on ordinary analytic approximations
coef(summary(vegetablesGlm))[c("ad1", "discount"), 2]
##      ad1 discount 
## 0.023956 0.002298

We observe that the standard errors based on parametric bootstrapping are almost identical to those obtained from the analytic approximation. Those based on non-parametric bootstrapping are, however, 5-7 times larger.

We can compute confidence intervals based on the bootstrapped estimates of standard error. These intervals are in this case much wider than those based on the Poisson model. The explanation is the overdispersion and the poor fit of the variance model, that we observed for the Poisson model.

betahat <- coefficients(vegetablesGlm)[c("ad1", "discount")]
betahat + 1.96 * sebeta %o% c(-1, 1)
##         [,1]    [,2]
## [1,] 0.15957 0.77891
## [2,] 0.02297 0.06514

An alternative is to use another combinant and the quantiles from its bootstrapped distribution. Using the simple combinant \[ \hat{\beta} - \beta \] results in the following confidence intervals.

betastar <- scale(beta, betahat, scale = FALSE)
qbeta <- t(apply(betastar, 2, quantile, probs = c(0.975, 0.025)))
betahat - qbeta
##        97.5%    2.5%
## [1,] 0.20540 0.81007
## [2,] 0.02612 0.06687

We observe that these confidence intervals are moved slightly to the right. If we consider the density estimate of the boostrapped distribution we see a left skewness in the distribution, which is what is reflected as a translation to the right of the confidence intervals.

library(ggplot2)
library(reshape2)
mbeta <- data.frame(beta)
names(mbeta) <- c("ad1", "discount")
mbeta <- melt(mbeta)
qplot(x = value, data = mbeta, fill = I("gray"), geom = "density") + facet_wrap(~variable, 
    scales = "free", ncol = 1)

plot of chunk bootdens

The choice of \( B = 999 \) above is the recommended choice if we want to compute the quantile based confidence intervals. A choice of \( B = 200 \) is sufficient if we just want to estimate standard errors.

The quasi Poisson and the \( \Gamma \)-model

For comparison we compute confidence intervals based on the quasi Poisson model and the \( \Gamma \)-model.

vegetablesGlm2 <- glm(form, family = quasipoisson, data = vegetables)
confint.default(vegetablesGlm2, c("ad1", "discount"))
##            2.5 %  97.5 %
## ad1      0.30806 0.63043
## discount 0.02859 0.05952

The use of the quasi Poisson model clearly improves the confidence intervals, but they are still not as wide as those obtained from non-paramtric bootstrapping. The quasi Poisson model is only capable of correcting for overdispersion, and does not correct for a lack of fit of the variance function.

vegetablesGlm3 <- glm(form, family = Gamma("log"), data = vegetables)
confint.default(vegetablesGlm3, c("ad1", "discount"))
##             2.5 %  97.5 %
## ad1      0.004606 0.36234
## discount 0.010511 0.03825

Though the \( \Gamma \)-model with the log-link has the same mean value structure as the Poisson model, the fit and the confidence intervals are quite different. The fit will be different because the weights obtained from the variance function are different. Cases with a large expected value will be weighted further down when we use the \( \Gamma \)-model, and will thus contribute less to the fit. However, I also suspect that there is a lack of fit of the linear model of discount, and that this is the reason that the change of model causes a dramatic change in the fitted model and the confidence intervals. This has to be investigated further.