R provides a formula interface via the lm
function to linear models. That is, instead of specifying the
\(\mathbf{X}\)-matrix explicitly, it is derived from a formula. This is very convenient,
and it allows us to focus on variables instead of how they are encoded as columns in
\(\mathbf{X}\). It is, however, useful to know precisely how R interprets the formulas
and translate them into matrices.
The \(\mathbf{X}\)-matrix is known as the model matrix in R terminology. Internally to
lm
the function model.matrix
is used to create the model matrix from the formula.
We usually don't need to call this function, but it is illustrative for understanding
exactly how formulas are parsed.
We need an example data set to work with.
b <- rnorm(150, 0)
c <- factor(sample(c(rep("A", 50), rep("B", 50), rep("C", 50))))
mu <- (c == "A") + cos(b) + 0.5 * (c == "B") * sin(b)
a <- rnorm(150, mu, 0.5)
and we plot the data. The dashed curves are the true means.
library(ggplot2)
library(gridExtra)
## Loading required package: grid
p1 <- qplot(b, a)
p2 <- qplot(b, a, color = c) +
geom_line(aes(y = mu), linetype = 2) +
theme(legend.position="top")
grid.arrange(p1, p2, ncol = 1)
Including just the variable on the right hand side of a formula results in a model matrix with two columns. A one-column representing the intercept and a column with the observed values of the variable.
model.matrix(a ~ b)[1:10, ]
## (Intercept) b
## 1 1 0.97806
## 2 1 -0.06564
## 3 1 -0.41319
## 4 1 -1.19396
## 5 1 1.11093
## 6 1 0.09952
## 7 1 1.34418
## 8 1 -0.72335
## 9 1 1.17599
## 10 1 -1.82327
Find out how you get rid of the intercept.
If we “add” a factor on \(k\) levels the model matrix gets \(k-1\)
additional columns. These are dummy variable encodings of the
levels of the factor. The precise encoding is controlled by
the contrasts.arg
argument to model.matrix
(the contrasts
argument to lm
) or by setting the contrasts associated with
a factor.
model.matrix(a ~ b + c)[1:10, ]
## (Intercept) b cB cC
## 1 1 0.97806 0 0
## 2 1 -0.06564 1 0
## 3 1 -0.41319 1 0
## 4 1 -1.19396 0 1
## 5 1 1.11093 0 1
## 6 1 0.09952 1 0
## 7 1 1.34418 1 0
## 8 1 -0.72335 0 1
## 9 1 1.17599 0 0
## 10 1 -1.82327 0 0
### We change the base level to be level 3 (value "C")
contrasts(c) <- contr.treatment(levels(c), base = 3)
model.matrix(a ~ b + c)[1:10, ]
## (Intercept) b cA cB
## 1 1 0.97806 1 0
## 2 1 -0.06564 0 1
## 3 1 -0.41319 0 1
## 4 1 -1.19396 0 0
## 5 1 1.11093 0 0
## 6 1 0.09952 0 1
## 7 1 1.34418 0 1
## 8 1 -0.72335 0 0
## 9 1 1.17599 1 0
## 10 1 -1.82327 1 0
In formulas, the arithmetic operators *
and ^
together with
:
have special meanings.
b * c
is equivalent b + c + b : c
b : c
formulates the interaction model.(a + b + c)^2
is equivalent to (a + b + c) * (a + b + c)
is equivalent to a + b + c + a : b + a : c + b : c
, etc. model.matrix(a ~ c + b : c)[1:10, ]
## (Intercept) cA cB cA:b cB:b cC:b
## 1 1 1 0 0.9781 0.00000 0.0000
## 2 1 0 1 0.0000 -0.06564 0.0000
## 3 1 0 1 0.0000 -0.41319 0.0000
## 4 1 0 0 0.0000 0.00000 -1.1940
## 5 1 0 0 0.0000 0.00000 1.1109
## 6 1 0 1 0.0000 0.09952 0.0000
## 7 1 0 1 0.0000 1.34418 0.0000
## 8 1 0 0 0.0000 0.00000 -0.7234
## 9 1 1 0 1.1760 0.00000 0.0000
## 10 1 1 0 -1.8233 0.00000 0.0000
model.matrix(a ~ b * c)[1:10, ]
## (Intercept) b cA cB b:cA b:cB
## 1 1 0.97806 1 0 0.9781 0.00000
## 2 1 -0.06564 0 1 0.0000 -0.06564
## 3 1 -0.41319 0 1 0.0000 -0.41319
## 4 1 -1.19396 0 0 0.0000 0.00000
## 5 1 1.11093 0 0 0.0000 0.00000
## 6 1 0.09952 0 1 0.0000 0.09952
## 7 1 1.34418 0 1 0.0000 1.34418
## 8 1 -0.72335 0 0 0.0000 0.00000
## 9 1 1.17599 1 0 1.1760 0.00000
## 10 1 -1.82327 1 0 -1.8233 0.00000
Transformations can be achieved by formally applying an R function to a variable in the formula specification. Expansions can be achieved if the R function returns a matrix and not a vector when applied to a vector.
library(splines)
model.matrix(a ~ ns(b, df = 4))[1:10, ]
## (Intercept) ns(b, df = 4)1 ns(b, df = 4)2 ns(b, df = 4)3 ns(b, df = 4)4
## 1 1 0.32211 0.52372 0.1935 -0.039356
## 2 1 0.73183 -0.03776 0.1819 -0.106691
## 3 1 0.52349 -0.13980 0.3406 -0.199725
## 4 1 0.13281 -0.19563 0.4730 -0.277369
## 5 1 0.24933 0.54597 0.2180 -0.013306
## 6 1 0.77739 0.03664 0.1247 -0.073106
## 7 1 0.14966 0.53770 0.2558 0.056891
## 8 1 0.32753 -0.18489 0.4470 -0.262136
## 9 1 0.21809 0.54945 0.2292 0.003281
## 10 1 0.01983 -0.13094 0.3166 -0.185648
Before you run the next expression in R, try to answer the following question on the basis of what you have learned.
How many columns will the model matrix have? Explain and interpret what they mean.
model.matrix(a ~ ns(b, df = 4) * c)[1:10, ]
Using the model matrix, or rather the formula, from above, we fit a linear model.
aLm <- lm(a ~ ns(b, df = 4) * c)
muHat <- predict(aLm)
qplot(b, a, color = c) +
geom_line(aes(y = mu), linetype = 2) +
geom_line(aes(y = muHat))
The predicted values from the model are estimated and plotted as curves. The dashed curves are still the true means.