Formulas in R

R provides a formula interface via the lm function to linear models. That is, instead of specifying the \(\mathbf{X}\)-matrix explicitly, it is derived from a formula. This is very convenient, and it allows us to focus on variables instead of how they are encoded as columns in \(\mathbf{X}\). It is, however, useful to know precisely how R interprets the formulas and translate them into matrices.

The \(\mathbf{X}\)-matrix is known as the model matrix in R terminology. Internally to lm the function model.matrix is used to create the model matrix from the formula. We usually don't need to call this function, but it is illustrative for understanding exactly how formulas are parsed.

Sample data

We need an example data set to work with.

b <- rnorm(150, 0)
c <- factor(sample(c(rep("A", 50), rep("B", 50), rep("C", 50))))
mu <- (c == "A") + cos(b) + 0.5 * (c == "B") * sin(b)
a <- rnorm(150, mu, 0.5)

and we plot the data. The dashed curves are the true means.

library(ggplot2)
library(gridExtra)

## Loading required package: grid

p1 <- qplot(b, a)
p2 <- qplot(b, a, color = c) + 
  geom_line(aes(y = mu), linetype = 2) +
  theme(legend.position="top")
grid.arrange(p1, p2, ncol = 1)

plot of chunk exampleFig

Simple model matrices

Including just the variable on the right hand side of a formula results in a model matrix with two columns. A one-column representing the intercept and a column with the observed values of the variable.

model.matrix(a ~ b)[1:10, ]

##    (Intercept)        b
## 1            1  0.97806
## 2            1 -0.06564
## 3            1 -0.41319
## 4            1 -1.19396
## 5            1  1.11093
## 6            1  0.09952
## 7            1  1.34418
## 8            1 -0.72335
## 9            1  1.17599
## 10           1 -1.82327

Find out how you get rid of the intercept.

If we “add” a factor on \(k\) levels the model matrix gets \(k-1\) additional columns. These are dummy variable encodings of the levels of the factor. The precise encoding is controlled by the contrasts.arg argument to model.matrix (the contrasts argument to lm) or by setting the contrasts associated with a factor.

model.matrix(a ~ b + c)[1:10, ]

##    (Intercept)        b cB cC
## 1            1  0.97806  0  0
## 2            1 -0.06564  1  0
## 3            1 -0.41319  1  0
## 4            1 -1.19396  0  1
## 5            1  1.11093  0  1
## 6            1  0.09952  1  0
## 7            1  1.34418  1  0
## 8            1 -0.72335  0  1
## 9            1  1.17599  0  0
## 10           1 -1.82327  0  0

### We change the base level to be level 3 (value "C")
contrasts(c) <- contr.treatment(levels(c), base = 3)
model.matrix(a ~ b + c)[1:10, ]

##    (Intercept)        b cA cB
## 1            1  0.97806  1  0
## 2            1 -0.06564  0  1
## 3            1 -0.41319  0  1
## 4            1 -1.19396  0  0
## 5            1  1.11093  0  0
## 6            1  0.09952  0  1
## 7            1  1.34418  0  1
## 8            1 -0.72335  0  0
## 9            1  1.17599  1  0
## 10           1 -1.82327  1  0

Interactions

In formulas, the arithmetic operators * and ^ together with : have special meanings.

Usage of b * c is equivalent b + c + b : c
Usage of b : c formulates the interaction model.
Usage of (a + b + c)^2 is equivalent to (a + b + c) * (a + b + c) is equivalent to a + b + c + a : b + a : c + b : c, etc.

model.matrix(a ~ c + b : c)[1:10, ]

##    (Intercept) cA cB    cA:b     cB:b    cC:b
## 1            1  1  0  0.9781  0.00000  0.0000
## 2            1  0  1  0.0000 -0.06564  0.0000
## 3            1  0  1  0.0000 -0.41319  0.0000
## 4            1  0  0  0.0000  0.00000 -1.1940
## 5            1  0  0  0.0000  0.00000  1.1109
## 6            1  0  1  0.0000  0.09952  0.0000
## 7            1  0  1  0.0000  1.34418  0.0000
## 8            1  0  0  0.0000  0.00000 -0.7234
## 9            1  1  0  1.1760  0.00000  0.0000
## 10           1  1  0 -1.8233  0.00000  0.0000

model.matrix(a ~ b * c)[1:10, ]

##    (Intercept)        b cA cB    b:cA     b:cB
## 1            1  0.97806  1  0  0.9781  0.00000
## 2            1 -0.06564  0  1  0.0000 -0.06564
## 3            1 -0.41319  0  1  0.0000 -0.41319
## 4            1 -1.19396  0  0  0.0000  0.00000
## 5            1  1.11093  0  0  0.0000  0.00000
## 6            1  0.09952  0  1  0.0000  0.09952
## 7            1  1.34418  0  1  0.0000  1.34418
## 8            1 -0.72335  0  0  0.0000  0.00000
## 9            1  1.17599  1  0  1.1760  0.00000
## 10           1 -1.82327  1  0 -1.8233  0.00000

Basis expansions

Transformations can be achieved by formally applying an R function to a variable in the formula specification. Expansions can be achieved if the R function returns a matrix and not a vector when applied to a vector.

library(splines)
model.matrix(a ~ ns(b, df = 4))[1:10, ]

##    (Intercept) ns(b, df = 4)1 ns(b, df = 4)2 ns(b, df = 4)3 ns(b, df = 4)4
## 1            1        0.32211        0.52372         0.1935      -0.039356
## 2            1        0.73183       -0.03776         0.1819      -0.106691
## 3            1        0.52349       -0.13980         0.3406      -0.199725
## 4            1        0.13281       -0.19563         0.4730      -0.277369
## 5            1        0.24933        0.54597         0.2180      -0.013306
## 6            1        0.77739        0.03664         0.1247      -0.073106
## 7            1        0.14966        0.53770         0.2558       0.056891
## 8            1        0.32753       -0.18489         0.4470      -0.262136
## 9            1        0.21809        0.54945         0.2292       0.003281
## 10           1        0.01983       -0.13094         0.3166      -0.185648

Before you run the next expression in R, try to answer the following question on the basis of what you have learned.

How many columns will the model matrix have? Explain and interpret what they mean.

model.matrix(a ~ ns(b, df = 4) * c)[1:10, ]

A linear model

Using the model matrix, or rather the formula, from above, we fit a linear model.

aLm <- lm(a ~ ns(b, df = 4) * c)
muHat <- predict(aLm)
qplot(b, a, color = c) + 
  geom_line(aes(y = mu), linetype = 2) + 
  geom_line(aes(y = muHat))

plot of chunk linearModel

The predicted values from the model are estimated and plotted as curves. The dashed curves are still the true means.