Poisson regression and numerical optimization

class: center, middle, inverse, title-slide

# Poisson regression and numerical optimization
### Niels Richard Hansen
### September 24, 2020

---

## Poisson regression

Consider observations `$y_i \in \mathbb{N}_0$`, `$x_i \in \mathbb{R}^p$` for
`$i = 1, \ldots, n$`, and Poisson point probabilities 
`$$f_i(y_i) = e^{-\mu(x_i)} \frac{\mu(x_i)^{y_i}}{y_i!}.$$`

If `$\log(\mu(x_i)) = x_i^T \beta$` we rewrite

`$$f_i(y_i) = e^{\beta^T x_i y_i - \exp( x_i^T \beta)} \frac{1}{y_i!}.$$`

This is a *Poisson regression model*.

---
## Example

```r
summary(vegetables)
```

```
      sale          normalSale         store     
 Min.   :  1.00   Min.   :  0.20   1      :   4  
 1st Qu.: 12.00   1st Qu.:  4.20   102    :   4  
 Median : 21.00   Median :  7.25   106    :   4  
 Mean   : 40.29   Mean   : 11.72   107    :   4  
 3rd Qu.: 40.00   3rd Qu.: 12.25   11     :   4  
 Max.   :571.00   Max.   :102.00   110    :   4  
                                   (Other):1042  
```

```r
dim(vegetables)
```

```
[1] 1066    3
```

---
## Poisson model

`$$\log(E(\text{sale})) = \beta_0 + \beta_1 \log(\text{normalSale}) + \beta_{\text{store}}.$$`

```r
## Note, variable store is a factor with 352 levels!
pois_model <- glm(
  sale ~ log(normalSale) + store, 
  data = vegetables, 
  family = poisson()
)
```

```r
summary(pois_model) %>% 
  coefficients() %>% 
  head()
```

```
                Estimate Std. Error z value Pr(>|z|)
(Intercept)        2.718      0.128    21.3 9.4e-101
log(normalSale)    0.202      0.031     6.5  9.5e-11
store10            1.577      0.114    13.9  1.1e-43
store100           0.768      0.125     6.2  7.0e-10
store101           0.583      0.131     4.5  8.2e-06
store102          -0.029      0.149    -0.2  8.4e-01
```

---
## Exponential families

Joint density

`\begin{equation}
f(\mathbf{y} \mid \theta) = \prod_{i=1}^n \frac{1}{\varphi_i(\theta)} e^{\theta^T t_i(y_i)} = e^{\theta^T \sum_{i=1}^n t_i(y_i) - \sum_{i=1}^n \kappa_i(\theta)}
\end{equation}`

where `$\varphi_i(\theta) = \int e^{\theta^T t_i(u)} \mu_i(\mathrm{d}u)$` and `$\kappa_i(\theta) = \log(\varphi_i(\theta)).$`
The log-likelihood is 
`$$\ell(\theta) = \theta^T t(\mathbf{y}) - \kappa(\theta)$$`
where

`$$t(\mathbf{y}) = \sum_{i=1}^m t_i(y_i) \quad \text{and} \quad
\kappa(\theta) = \sum_{i=1}^m \log \varphi_i(\theta).$$`

The gradient is
`$$\nabla \ell(\theta) = t(\mathbf{y}) - \nabla \kappa(\theta).$$`

---
## The Poisson regression model

For the Poisson regression model we find that

`$$t(\mathbf{y}) = \sum_{i=1}^n x_i y_i = \mathbf{X}^T \mathbf{y} \quad \text{and} \quad
\kappa(\beta) = \sum_{i=1}^n e^{x_i^T \beta}$$`

Moreover, 
`$$\nabla \kappa(\beta) = \sum_{i=1}^n x_i e^{x_i^T \beta} = \mathbf{X}^T \mathbf{w}(\beta)$$`
where
$$ \mathbf{w}(\beta) = \exp(\mathbf{X}^T \beta)$$
with `$\exp$` applied coordinatewisely.

---
## Model matrix and `glm.fit()`

```r
X <- model.matrix(
  sale ~ log(normalSale) + store, 
  data = vegetables
)
y <- vegetables$sale
```

```r
system.time(
  pois_fit <- glm.fit(X, y, family = poisson())
)
```

```
   user  system elapsed 
  0.540   0.008   0.550 
```

---
## `glm.fit()`

```r
pois_fit$iter
```

```
[1] 5
```

```r
pois_fit$converged
```

```
[1] TRUE
```

The objective function is not computed directly as the negative log-likelihood
but as twice the negative log-likelihood plus an additive constant. For comparisons
it's easiest to compute deviance differences.

```r
pois_fit0 <- glm.fit(X[, 1], y, family = poisson())
pois_fit0$deviance - pois_fit$deviance
```

```
[1] 42592.68
```

---
## `glm.fit()`

```r
pois_fit <- glm.fit(X, y, family = poisson(), control = list(trace = TRUE))
```

```
Deviance = 9410.28 Iterations - 1
Deviance = 8595.537 Iterations - 2
Deviance = 8584.647 Iterations - 3
Deviance = 8584.638 Iterations - 4
Deviance = 8584.638 Iterations - 5
```

The stopping criterion is *small relative descent* in terms of deviance, 
that is 
`$$|\mathrm{deviance}_{n-1} - \mathrm{deviance}_n| < \varepsilon (\mathrm{deviance}_n + 0.1)$$`
with the default tolerance parameter `$\varepsilon = 10^{-8}$` and with a maximal 
number of iterations set to 25.

---
## Implementation of objective function

```r
t_map <- drop(crossprod(X, y))

H <- function(beta) 
  drop(sum(exp(X %*% beta)) - beta %*% t_map) / nrow(X)

grad_H <- function(beta) 
  (drop(crossprod(X, exp(X %*% beta))) - t_map) / nrow(X)
```

Recomputing the deviance difference

```r
2 * nrow(X) * (H(c(pois_fit0$coefficients, rep(0, 352))) -  H(pois_fit$coefficients)) 
```

```
[1] 42592.68
```

and the value of the negative log-likelihood

```r
H(pois_fit$coefficients)
```

```
[1] -128.5895
```

---
## Using `optim()` with conjugate gradient

```r
system.time(
  pois_CG <- optim(
    rep(0, length=ncol(X)), 
    H, 
    grad_H, 
    method = "CG", 
    control = list(maxit = 10000)
  )
)
```

```
   user  system elapsed 
 10.989   0.068  11.090 
```

```r
pois_CG$value
```

```
[1] -128.5895
```

```r
pois_CG$counts
```

```
function gradient 
   10858     4641 
```

---
## Using `optim()` with BFGS

```r
system.time(
  pois_BFGS <- optim(
    rep(0, length=ncol(X)), 
    H, 
    grad_H, 
    method = "BFGS", 
    control = list(maxit = 10000)
  )
)
```

```
   user  system elapsed 
  0.262   0.001   0.264 
```

```r
pois_BFGS$value
```

```
[1] -128.5894
```

```r
pois_BFGS$counts
```

```
function gradient 
     126      114 
```

---
## Using sparse matrices

```r
library(Matrix)
```

```r
X <- Matrix(X)
```

```r
system.time(
  pois_BFGS_sparse <- optim(
    rep(0, length=ncol(X)), 
    H, 
    grad_H, 
    method = "BFGS", 
    control = list(maxit = 10000)
  )
)
```

```
   user  system elapsed 
  0.096   0.003   0.099 
```

```r
range(pois_BFGS$par - pois_BFGS_sparse$par)
```

```
[1] 0 0
```