Bivariate smoothing

class: center, middle, inverse, title-slide

# Bivariate smoothing
### Niels Richard Hansen
### September 11, 2019

---

## A scatter plot

```r
p <- qplot(Year, Temperature, data = Nuuk_year); p
```

---
## Nearest neighbor estimation

Data: `$(x_1, y_1), \ldots, (x_n, y_n)$`

The `$k$` nearest neighbor smoother in `$x$` is defined as 
`$$\hat{f}(x) = \frac{1}{k} \sum_{j \in N(x)} y_j$$`
where `$N(x)$` is the set of indices for the `$k$` nearest neighbors of `$x$`.

This is an estimator of

`$$f(x) = E(Y \mid X = x).$$`

---
## Nearest neighbor estimation

The estimator in `$x_i$` is 
`$$\hat{f}_i = \frac{1}{k} \sum_{j \in N_i} y_j$$`
where `$N_i = N(x_i)$`.

With `$S_{ij} = \frac{1}{k} 1(j \in N_i)$` and `$\mathbf{S} = (S_{ij})$` 
`$$\hat{\mathbf{f}} = (\hat{f}_i) = \mathbf{S} \mathbf{y}.$$`

`$\hat{\mathbf{f}}$` is an estimator of the vector `$(f(x_i))$`.

---
## Linear smoothers

Any estimator of `$(f(x_i))$` of the form `$\mathbf{S} \mathbf{y}$` 
for a *smoother matrix* `$\mathbf{S}$` is called a *linear smoother*.

The `$k$` nearest neighbor smoother is a simple example of a linear 
smoother that works for `$x$`-values in any metric space.

The representation of a linear smoother as a matrix-vector product,
`$$\mathbf{S} \mathbf{y}$$`
is theoretically useful, but often not the best way to actually
compute `$\hat{\mathbf{f}}$`.

---
## Running mean

When `$x_i \in \mathbb{R}$` we can sort data according to `$x$`-values and then use 
a *symmetric* neighbor definition:

`$$N_i = \{i - (k - 1) / 2, i - (k - 1) / 2 + 1, \ldots, \\ i - 1 , i, i + 1, \ldots,   i + (k - 1) / 2\}$$`

(for `$k$` odd.)

This simplifies computations: we don't need to keep track of metric comparisons,
only the order matters.

---
## Bias-variance tradeoff

If data is i.i.d. with `$V(Y \mid X) = \sigma^2$`:

`\begin{aligned}
& \mathrm{MSE}(x) =  E((f(x) - \hat{f}(x))^2 \mid X=x) \\
& = E((f(x)-E(\hat{f}(x)))^2 \mid X=x) + 
E((\hat{f}(x)-E(\hat{f}(x)))^2 \mid X=x) \\
& = \underbrace{\left[f(x) - \frac{1}{k} \sum_{l \in
    N(x)} f(x_l)\right]^2}_{\textrm{squared bias}} + 
    \underbrace{\frac{\sigma^2}{k}}_{\textrm{variance}}
\end{aligned}`

Here `$f(x) = E(Y \mid X = x)$`.

Small `$k$` gives large variance and small bias (if `$f$` is smooth).
Large `$k$` give small variance and potentially large bias (if `$f$` is not constant).

---
## Running mean

Implementation (assuming `$y$` in correct order) using the identity
`$$\hat{f}_{i+1} = \hat{f}_{i} - y_{i - (k-1)/2} + y_{i + (k + 1)/2},$$`

```r
## The vector 'y' must be sorted according to the x-values
runMean <- function(y, k) {
  n <- length(y)
  m <- floor((k - 1) / 2)
  k <- 2 * m + 1
  y <- y / k
  s <- rep(NA, n)
  s[m + 1] <- sum(y[1:k])
  for(i in (m + 1):(n - m - 1)) 
    s[i + 1] <- s[i] - y[i - m] + y[i + 1 + m]
  s
}
```

---
## Visualization

```r
f_hat <- runMean(Nuuk_year$Temperature, 11)
p + geom_line(aes(y = f_hat), color = "blue")
```

---
## Using 'filter'

The R function `filter` computes running means and moving averages

```r
f_hat_filter <- stats::filter(Nuuk_year$Temperature, rep(1/11, 11))
range(f_hat_filter - f_hat, na.rm = TRUE)
```

```
[1] -1.332268e-15  4.440892e-16
```

```r
f_hat_filter[c(1:10, 137:147)]
```

```
 [1]    NA    NA    NA    NA    NA -1.85 -1.72 -1.55 -1.52 -1.48 -0.55
[12] -0.52 -0.15 -0.23 -0.19 -0.12    NA    NA    NA    NA    NA
```

---
## Benchmarking

```
                                      expr     median
1                          S1 %*% y[1:512]   324.9540
2                         S2 %*% y[1:1024]  1327.9415
3                         S3 %*% y[1:2048]  5429.9425
4                         S4 %*% y[1:4196] 23142.1245
5                runMean(y[1:512], k = 11)    80.4405
6               runMean(y[1:1024], k = 11)   160.9555
7               runMean(y[1:2048], k = 11)   302.6750
8               runMean(y[1:4196], k = 11)   604.9305
9   stats::filter(y[1:512], rep(1/11, 11))    91.1445
10 stats::filter(y[1:1024], rep(1/11, 11))   114.6415
11 stats::filter(y[1:2048], rep(1/11, 11))   172.9890
12 stats::filter(y[1:4196], rep(1/11, 11))   263.9965
```

The Matrix-vector multiplication is `$O(n^2)$`.

The two other algorithms are `$O(n)$`.

---
## LOOCV

The running mean / nearest neighbour smoother is a *linear smoother*,
`$\hat{\mathbf{f}} = \mathbf{S} \mathbf{Y}.$`

How to predict `$y_i$` if `$(x_i, y_i)$` is left out?

A *definition* for a linear smoother is 
`$$\hat{f}^{-i}_i = \sum_{j \neq i} \frac{S_{ij}y_j}{1 - S_{ii}}.$$`

For many smoothing procedures with a natural "out-of-sample" prediction method
the identity above holds.

---
## LOOCV

It follows that for *leave-one-out cross validation*
`$$\mathrm{LOOCV} = \sum_{i} (y_i - \hat{f}^{-i}_i)^2 = 
\sum_{i} \left(\frac{y_i - \hat{f}_i}{1 - S_{ii}}\right)^2$$`

```r
loocv <- function(k, y) {
  f_hat <- runMean(y, k)
  mean(((y - f_hat) / (1 - 1/k))^2, na.rm = TRUE) 
}
```

---
##LOOCV

```r
k <- seq(3, 40, 2)
CV <- sapply(k, function(kk) loocv(kk, Nuuk_year$Temperature))
k_opt <- k[which.min(CV)]
qplot(k, CV) + geom_line() + geom_vline(xintercept = k_opt, color = "red")
```

---
##LOOCV

The optimal choice of `$k$` is 15.

```r
p + geom_line(aes(y = runMean(Nuuk_year$Temperature, 25)), color = "red") +
   geom_line(aes(y = runMean(Nuuk_year$Temperature, k_opt)), color = "blue")
```

---
## Smoothing splines

The minimizer of 
`$$L(f) = \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \underbrace{\int f''(z)^2 \mathrm{d} z}_{\|f''\|_2^2}$$`
is a cubic spline

with *knots* in the data points `$x_i$`, that is, a function 
`$$f = \sum_i \beta_i \varphi_i$$`
where `$\varphi_i$` are basis functions for the `$n$`-dimensional space of such splines.

Cubic splines are piecewise degree 3 polynomials in between knots.

---
## Smoothing splines

In vector notation 
`$$\hat{\mathbf{f}} = \boldsymbol{\Phi}\hat{\beta}$$`
with `$\boldsymbol{\Phi}_{ij} = \varphi_j(x_i)$`,

and
`\begin{aligned}
L(\mathbf{f}) & = (\mathbf{y} - \mathbf{f})^T (\mathbf{y} - \mathbf{f}) + \lambda \|f''\|_2^2 \\
& = ( \mathbf{y} -  \boldsymbol{\Phi}\beta)^T (\mathbf{y} -  \boldsymbol{\Phi}\beta) + \lambda \beta^T \mathbf{\Omega} \beta
\end{aligned}`

with 
`$$\mathbf{\Omega}_{ij} = \int \varphi_i''(z) \varphi_j''(z) \mathrm{d}z.$$`

---
## Smoothing splines

The minimizer is 
`$$\hat{\beta} = (\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T \mathbf{Y}$$`

with resulting smoother
`$$\hat{\mathbf{f}} = \underbrace{\boldsymbol{\Phi} ((\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{\Omega})^{-1}\boldsymbol{\Phi}^T}_{\mathbf{S}_{\lambda}} \mathbf{y}.$$`

---
## Splines in R

```r
library(splines)
## Note the specification of repeated boundary knots
knots <- c(0, 0, 0, seq(0, 1, 0.2), 1, 1, 1)
xx <- seq(0, 1, 0.005)
B_splines <- splineDesign(knots, xx)
matplot(xx, B_splines, type = "l", lty = 1)
```

---
## Penalty matrix

```r
## See notes for implementation of penalty matrix 
Omega <- pen_mat(seq(0, 1, 0.1))
image(Matrix(Omega))
```

---
## Fitting a smoothing spline

We implement the matrix-algebra directly 
for computing `$\mathbf{S}_{\lambda} \mathbf{y}$`.

```r
inner_knots <- Nuuk_year$Year
knots <- c(rep(range(inner_knots), 3), inner_knots)
Phi <- splineDesign(knots, inner_knots)
Omega <- pen_mat(inner_knots)
smoother <- function(lambda) 
  Phi %*% solve(crossprod(Phi) + lambda * Omega, 
                t(Phi) %*% Nuuk_year$Temperature)
```

---
## Fitting a smoothing spline

```r
p + geom_line(aes(y = smoother(0.1)), color = "blue") +   ## Undersmooth
  geom_line(aes(y = smoother(10)), color = "red") +       ## Smooth
  geom_line(aes(y = smoother(1000)), color = "purple")    ## Oversmooth
```

---
## Generalized cross-validation

With `$\mathrm{df} = \mathrm{trace}(\mathbf{S}) = \sum_{i=1}^n S_{ii}$`
we  replace `$S_{ii}$` in LOOCV by `$\mathrm{df} / n$` to get the
*generalized* cross-validation criterion
`$$\mathrm{GCV} = \sum_{i=1}^n \left(\frac{y_i - \hat{f}_i}{1 - \mathrm{df} / n}\right)^2.$$`

```r
gcv <- function(lambda, y) {
  S <- Phi %*% solve(crossprod(Phi) + lambda * Omega, t(Phi))
  df <- sum(diag(S))  ## The trace of the smoother matrix
  sum(((y - S %*% y) / (1 - df / length(y)))^2, na.rm = TRUE) 
}
```

---
## Generalized cross-validation

Then we apply this function to a grid of `$\lambda$`-values and 
choose the value of `$\lambda$` that minimizes GCV.

```r
lambda <- seq(50, 250, 2)
GCV <- sapply(lambda, gcv, y = Nuuk_year$Temperature)
lambda_opt <- lambda[which.min(GCV)]
```

---
## Generalized cross-validation

```r
qplot(lambda, GCV) + geom_vline(xintercept = lambda_opt, color = "red")
```

---
## Optimal smoothing spline

```r
temp_smooth_opt <- Phi %*% solve(crossprod(Phi) + lambda_opt * Omega, 
                                 t(Phi) %*% Nuuk_year$Temperature)
p + geom_line(aes(y = temp_smooth_opt), color = "blue")
```

---
## Using `smooth.spline`

```r
temp_smooth_splines <- smooth.spline(Nuuk_year$Year, Nuuk_year$Temperature, 
                all.knots = TRUE)  ## Don't use fast heuristic
p + geom_line(aes(y = temp_smooth_splines$y), color = "blue")
```

---
## Efficient computations

In practice we use `$p < n$` basis functions.

Using the singular value decomposition 
`$$\Phi = \mathbf{U} D \mathbf{V}^T$$`
it holds that 
`$$\mathbf{S}_{\lambda} = \widetilde{\mathbf{U}}  (I + \lambda  \Gamma)^{-1} \widetilde{\mathbf{U}}^T$$`
where `$\widetilde{\mathbf{U}} = \mathbf{U} \mathbf{W}$` and 
`$$D^{-1} \mathbf{V}^T \mathbf{\Omega} \mathbf{V} D^{-1} = \mathbf{W} \Gamma \mathbf{W}^T.$$`

---
## Efficient computations

The interpretation of this representation is as follows.

* First, the coefficients, `$\hat{\beta} = \widetilde{\mathbf{U}}^Ty$`, are computed for 
expanding `$y$` in the basis given by the columns of `$\mathbf{U}$`.

* Second, the `$i$`th coefficient is shrunk towards 0,
`$$\hat{\beta}_i(\lambda) = \frac{\hat{\beta}_i}{1 + \lambda \gamma_i}.$$`

* Third, the smoothed values, `$\widetilde{\mathbf{U}} \hat{\beta}(\lambda)$`, 
are computed as an expansion using the shrunken coefficients.

---
### The Demmler-Reinsch basis (columns of `$\widetilde{\mathbf{U}}$`)

---
## and the eigenvalues `$\gamma_i$`

---
## Nadaraya-Watson

The `ksmooth` function implements the Nadaraya-Watson smoother. There is no 
automatic choice of bandwidth.

Kernel smoothing for bivariate smoothing is easy to implement along the same 
lines as kernel smoothing for density estimation. Binning can be useful for 
handling large data sets.

Local regression smoothing can be implemented using `lm` and its `weights` 
argument, or perhaps more efficiently using `lm.wfit`. Refits are in principle 
needed in all points.

The `locpoly` function in the KernSmooth package does local polynomial 
regression using binning. It does not do automatic bandwidth selection.

For kernel smoothing it is quite easy to compute the diagonals in the smoothing 
matrix and thus LOOCV or GCV.

---
## Loess

The nonlinear loess estimator is implemented in the `loess` function. It could
be implemented using `lm.wfit`.

The `loess` function does not automatically select the span but has a default 
of 0.75, which is often too large.

Since loess is nonlinear, the formulas for linear smoothers do not apply. One can 
instead implement 5- or 10-fold cross validation.

The `loess` function does return an object with a `trace.hat` entry, which 
might be used as a surrogate for `$\mathrm{trace}(\mathbf{S})$` for GCV, say.

Loess is a robust smoother (linear smoothers are not) and relatively insensitive 
to outliers.

---
## A loess smoother

```r
p + geom_smooth()
```

```
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
```

---
## Another loess smoother

```r
p + geom_smooth(method = "loess", span = 0.5)
```

---
## A linear smoother

```r
p + geom_smooth(method = "lm")
```

---
## A polynomial smoother

```r
p + geom_smooth(method = "lm", formula = y ~ poly(x, 5))
```

---
## Another polynomial smoother

```r
p + geom_smooth(method = "lm", formula = y ~ poly(x, 20))
```

---
## A spline smoother

```r
p + geom_smooth(method = "gam", formula = y ~ s(x))
```

---
## Another spline smoother

```r
p + geom_smooth(method = "gam", formula = y ~ s(x, k=40))
```

---
## Smoothing with ggplot2

The `geom_smooth` function easily adds misc. model fits or scatter plot smoothers
to the scatter plot.

The `stat_smooth` function is reponsible for delegating the computations. 
See `?stat_smooth`.

Spline smoothing is performed via the `gam` function in the mgcv package, whereas
loess smoothing is via the `loess` function in the stats package.

Any "smoother" can be used that supports a formula interface and has a prediction 
function adhering to the standards of `predict.lm`.

---
## An interface for `geom_smooth`.

```r
rMean <- function(..., data, k = 5) {
  ord <- order(data$x)
  structure(list(x = data$x[ord], y = runMean(data$y[ord], k = k)), 
            class = "rMean")
}
predict.rMean <- function(object, newdata, ...) 
  approx(object$x, object$y, newdata$x)$y ## Linear interpolation
```

---
## A running mean

```r
p + geom_smooth(method = "rMean", se = FALSE, n = 200)
```

---
## Another running mean

```r
p + geom_smooth(method = "rMean", se = FALSE, n = 200, method.args = list(k = 13))
```

---
## Boundary

```r
rMean <- function(..., data, k = 5, boundary = NULL) {
  ord <- order(data$x)
  y <- data$y[ord]
  n <- length(y)
  m <- floor((k - 1) / 2)
  if (m > 0 & !is.null(boundary)) {
    if (boundary == "pad")
      y <- c(rep(y[1], m), y, rep(y[n], m))
    if (boundary == "rev")
      y <- c(y[m:1], y, y[n:(n - m + 1)])
  }    
  s <- runMean(y, k = k)
  if(!is.null(boundary))
    s <- na.omit(s)
  structure(list(x = data$x[ord], y = s), class = "rMean")
}
```

---
## No boundary

---
## Boundary, padding

---
## Boundary, reversion