── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(emmeans)
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
Import data
We load the birthwt dataset from the MASS package and make the relevant wrangling of the dataset. In the end, notice how levels 3-6 of ftvFac are collapsed into one level denoted “MoreThanOnce”.
# Load "birthwt" dataset from the MASS-packagedata(birthwt,package ="MASS")
For nice printing etc we reformulate birthwt as a tibble.
### Make a tibble with the databirthData <-as_tibble(birthwt)birthData
We recode the smoke variable from a numerical variable to a categorical one.
### Make smoke into a factorbirthData <-mutate(birthData, smoke =factor(smoke))
We examine the ftv variable, then construct a categorical version of the ftv variable, then make a version where levels 2 and above are collapsed into just one level.
# Check the ftv variable, make it a factor and collapse some of the levelstable(birthData$ftv)
We make boxplots for bwt for different levels of one or more variables. Birth weight appears to be smaller for infants born by smokers compared to non-smokers, while birth weight does not appear to depend on the frequency of vistis to the doctor. This is just preliminary consideratiions; formal analyses are carried out later.
Scatterplot with lwt on the \(x\)-axis and bwt on the \(y\)-axis. The color and shape of the points can be controlled by the options color and shape. There is no clear association between the two variables (mother’s weight before pregnancy and infant weight), but let us see in the analysis.
### Points coloured after visitsggplot(birthData, aes(x=lwt, y=bwt, color=visits)) +geom_point()
### Points coloured after visits, point types after smoke statusggplot(birthData, aes(x=lwt, y=bwt, color=visits, shape=smoke)) +geom_point()
Regression
We use the lm function to fit the simple linear regression model with infant’s birth weight as response and mothers weight before pregnancy as predictor. Then, we extract estimates and confidence intervals (CIs). Specifically we get a 95% CI 1.05-7.81 for the slope parameter. Notice that zero is not included in the CI, in line with the p-value of 0.01 which appears in the summary. So, despite the impression from the scatter plot above, there is a significant association between the two variables.
### Simple linear regressionreg1 <-lm(bwt ~ lwt, data=birthData) summary(reg1)
Call:
lm(formula = bwt ~ lwt, data = birthData)
Residuals:
Min 1Q Median 3Q Max
-2192.12 -497.97 -3.84 508.32 2075.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2369.624 228.493 10.371 <2e-16 ***
lwt 4.429 1.713 2.585 0.0105 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 718.4 on 187 degrees of freedom
Multiple R-squared: 0.0345, Adjusted R-squared: 0.02933
F-statistic: 6.681 on 1 and 187 DF, p-value: 0.0105
We make the usual model validation plots. None of them gives us any reasons to worry about the validity of the model assumtions.
### Model validationpar(mfrow=c(2,2))plot(reg1)
We include extra predictors in the model, so we now get a multiple linear regression. The regression coefficients are estimated to 4.12 for mother’s weight (lwt), 7.49 for age and 15.4 for number of visits to the doctor (ftv). The interpretation of the last one, say, is that the birth weight is expected to increase by 15.4 grams per extra visit by the doctor, all other predictors kept unchanged. Notice that the estimate is not significantly different from zero (p=0.76) though.
### Include age and ftv as covariatereg2 <-lm(bwt ~ lwt + age + ftv, data=birthData)summary(reg2)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2223.505618 301.567140 7.3731694 5.402210e-12
lwt 4.120744 1.757787 2.3442798 2.012498e-02
age 7.485748 10.285178 0.7278189 4.676446e-01
ftv 15.363944 51.113921 0.3005824 7.640705e-01
We make a new (boring dataset) with just one data line and values of each of the three predictors from the multiple linear regression from the previous question. The predict function computes the corresponding predicted value for the birth weight. It needs to know the predictor values as well as which model to use for the prediction (you could try using reg1 instead of reg2; then you would get a slightly different prediction). The interval option makes it possible to supplement the point prediction with a 95% prediction interval, i.e., an interval that contains the birth weight for 95% of infants with the given predictor characteristics.
We use a different dataset for the regression, namely a subset constructed with the filter function. There is just for the sake of the example, so we refrain from looking at the results in any detail.
### Use only data from mothers with weight below 160reg3 <-lm(bwt ~ lwt + age + ftv, data=filter(birthData, lwt<160))summary(reg3)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1824.063262 444.508301 4.1035528 0.0000651941
lwt 7.182184 3.341787 2.1492047 0.0331503757
age 9.178577 11.624301 0.7896024 0.4309508695
ftv 16.986307 60.011783 0.2830495 0.7775117320
ANOVA
We use lm to fit the oneway ANOVA. The non-smokers (smoke=0) are used as reference, so the expected birth weight for that group is given by the intercept, 3056 grams. For the estimate for smokers, we must add the contrast/difference caused by the smoke variable (which is negative). We get 3056-284 = 2772. With emmeans we get both numbers automatically (but then not the difference). The p-value for the effect of smoking is 0.0087, which can be found both from the summary and the pairs output, so there is a statistically significant effect of smoking.
### Oneway ANOVA against smokeoneway1 <-lm(bwt ~ smoke, data=birthData)summary(oneway1)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3055.6957 66.93326 45.652875 2.463035e-103
smoke1 -283.7767 106.96877 -2.652893 8.666726e-03
We now fit a another oneway ANOVA where visits from question 4 is used as the explanatory variable. The groupwise estimated birth weights are most easily found with emmeans and are 2865 g, 3108 g, and 2951 g, respectively. The output from pairs shows that none of the groups differ statistically significantly from each other (p-valuer are 0.14 or larger). Finally, the overall test gives a p-value of 0.17 (drop1), confirming that groups do not differ significantly.
### Oneway ANOVA against visitsoneway2 <-lm(bwt ~ visits, data=birthData)emmeans(oneway2,~visits)
visits emmean SE df lower.CL upper.CL
Never 2865 72.6 186 2722 3008
Once 3108 106.0 186 2899 3317
MoreThanOnce 2951 112.0 186 2730 3172
Confidence level used: 0.95
pairs(emmeans(oneway2,~visits))
contrast estimate SE df t.ratio p.value
Never - Once -242.9 128 186 -1.891 0.1441
Never - MoreThanOnce -85.7 134 186 -0.642 0.7970
Once - MoreThanOnce 157.1 154 186 1.019 0.5658
P value adjustment: tukey method for comparing a family of 3 estimates
drop1(oneway2,test="F")
Single term deletions
Model:
bwt ~ visits
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 98081730 2493.2
visits 2 1887925 99969656 2492.8 1.7901 0.1698
A twoway ANOVA with both smoke and visits is fitted. It is the model without interaction because there is a plus (+) between the two variables. Reference groups are chosen for each variable (smoke=0 corresponding to non-smokers and vists=Never), and the intercept 2981 g is the estimated birth weight for the reference combination. For the contrast, the birth weight is estimated to decrease by 257 g if the mother is a smoker, and to increase by 193 g and 74 g if the mother visits the doctor once or more than once, respectively. We also get p-values for each of these contrasts, and conclusions are unchanged from the oneway ANOVAs: There is a significant effect of smoking but not a significant effect of visiting the doctor.
The twoway ANOVA with interaction is fitted by using a multiplication sign or star (*) between the two variables. This allows for the effect os smoking to differ between visits groups, or vice versa. The hypothesis of no interaction can be carried out with the anova function, where the models to be compared (the models with and without interaction) are specified explicitly, or with drop1. The same test is carried out, so we get the same p-value, 0.14. So, the interaction effect is not significant. By the way, the function name anova is a bit annoying since it can be used for many model types, not only ANOVA models.
### Twoway ANOVA with interaction, test for interaction in two waystwoway2 <-lm(bwt ~ visits * smoke, data=birthData)anova(twoway2, twoway1)
Analysis of Variance Table
Model 1: bwt ~ visits * smoke
Model 2: bwt ~ visits + smoke
Res.Df RSS Df Sum of Sq F Pr(>F)
1 183 93189162
2 185 95182096 -2 -1992934 1.9568 0.1443
drop1(twoway2,test="F")
Single term deletions
Model:
bwt ~ visits * smoke
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 93189162 2489.5
visits:smoke 2 1992934 95182096 2489.5 1.9568 0.1443
The emmeans function can be used to compute group-specific estimates, on average over the other variables in the model. We get the estimated birth weights of 3066 g and 2772 g, respectively. This is not exactly the same as in question 12: The estimates in question 12 were simple averages in the two groups, while the new estimates are averages of estimates. This is not the same when the dataset is unbalanced wrt. to the two variables.
emmeans(twoway2, ~smoke)
NOTE: Results may be misleading due to involvement in interactions
smoke emmean SE df lower.CL upper.CL
0 3066 70.1 183 2928 3205
1 2766 96.4 183 2576 2956
Results are averaged over the levels of: visits
Confidence level used: 0.95
Models with numerical as well as categorical predictors
model1 includes both a numerical and a categorical explanatory variable. Birth weight is estimated to increase 4.24 g for each extra pound of mother’s weight (no matter the smoking status) and to differ 272 g for smoking and non-smoking mothers. The intercept is the estimated birth weight for a non-smoking mother with a weight of zero pounds! This is of course nonsense - as is often the case with regression type models when zero values of a numerical predictor cannot occur in practice.
### Model with linear (lwt,bwt) association. ### Intercept differ between smokers and non-smokers, one common slope.model1 <-lm(bwt ~ lwt + smoke, data=birthData)summary(model1)
Call:
lm(formula = bwt ~ lwt + smoke, data = birthData)
Residuals:
Min 1Q Median 3Q Max
-2030.90 -445.69 29.16 521.76 1967.76
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2501.125 230.836 10.835 <2e-16 ***
lwt 4.237 1.690 2.507 0.0130 *
smoke1 -272.081 105.591 -2.577 0.0107 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 707.8 on 186 degrees of freedom
Multiple R-squared: 0.06777, Adjusted R-squared: 0.05775
F-statistic: 6.761 on 2 and 186 DF, p-value: 0.001464
Interaction between a numerical and categorical variable is interpreted as different slopes (wrt. the numerical variable) for the different groups. Here, this means two straight lines (x=lwt and y=bwt), one for smokers and one for non-smokers, with different intercepts and different slopes. The model from question 17 had different intercepts, but the same slope. Both the intercept and the slope are reported in terms of a reference groups (non-smokers) and a difference between smokers and non-smokers, and the two estimated regression lines are therefore given by: bwt = 2351 + 5.39 * lwt for non-smokers and bwt = (2351+41) + (5.39-2.42) * lwt = 2392 - 2.97 * lwt for smokers. The interaction term, as reported the last line in the coefficients table, estimates the difference in the weight effect between non-smokers and smokers, so this is where we should look to answer the last question. The p-value is 0.48, so there is no evidence that the effect of mother’s weight affects the birth weight is different for smokers and non-smokers. The same test can be carried out by comparing to the additive model from question 17, using the anova function.
### Model with linear (lwt,bwt) association. Intercept and slope both differ between visit groupsmodel2 <-lm(bwt ~ lwt * smoke, data=birthData)summary(model2)
Call:
lm(formula = bwt ~ lwt * smoke, data = birthData)
Residuals:
Min 1Q Median 3Q Max
-2038.80 -454.76 28.36 530.84 1976.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2350.578 312.733 7.516 2.35e-12 ***
lwt 5.387 2.335 2.307 0.0222 *
smoke1 41.384 451.187 0.092 0.9270
lwt:smoke1 -2.422 3.388 -0.715 0.4757
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 708.8 on 185 degrees of freedom
Multiple R-squared: 0.07034, Adjusted R-squared: 0.05527
F-statistic: 4.666 on 3 and 185 DF, p-value: 0.003621
### Test if slopes differ between visit groupsanova(model2, model1)
Analysis of Variance Table
Model 1: bwt ~ lwt * smoke
Model 2: bwt ~ lwt + smoke
Res.Df RSS Df Sum of Sq F Pr(>F)
1 185 92937722
2 186 93194298 -1 -256576 0.5107 0.4757
We now include more variables in the model. The model validation plots cause no worries. Smoking is estimated to decrease infant birth weight by 244 g (SE 107 g), and the effect is statistically significant (p=0.017).
### Model with many effects (no interactions)model3 <-lm(bwt ~ lwt + smoke + age + visits, data=birthData)par(mfrow=c(2,2))plot(model3)
summary(model3)
Call:
lm(formula = bwt ~ lwt + smoke + age + visits, data = birthData)
Residuals:
Min 1Q Median 3Q Max
-2024.13 -491.65 6.56 507.56 1753.25
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2346.442 301.420 7.785 5e-13 ***
lwt 4.167 1.727 2.413 0.0168 *
smoke1 -244.402 107.187 -2.280 0.0238 *
age 4.300 10.252 0.419 0.6754
visitsOnce 184.314 130.194 1.416 0.1586
visitsMoreThanOnce 32.318 133.410 0.242 0.8089
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 708.6 on 183 degrees of freedom
Multiple R-squared: 0.08076, Adjusted R-squared: 0.05564
F-statistic: 3.215 on 5 and 183 DF, p-value: 0.008306
Logistic regression
We now fit a logistic regression, modelling the probability of a low birthweight defines as <2500 g. The effect of smoking is just not significant (p=0.06), which most likely reflects that we loose information going from the actual birth weights to the binary version. Notice that the estimated coefficient for smoking is now positive (0.62). This means that smoking increases the probability of getting a child with low birth weight; hence the direction of a potential effect is still the same: Smoking increases the likelihood of a smaller birth weight.
### Logistic regression with many predictors (no interactions)logreg1 <-glm(low ~ lwt + smoke + age + visits, data=birthData, family="binomial")summary(logreg1)
Call:
glm(formula = low ~ lwt + smoke + age + visits, family = "binomial",
data = birthData)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.370311 1.016623 1.348 0.1777
lwt -0.012276 0.006138 -2.000 0.0455 *
smoke1 0.619077 0.330428 1.874 0.0610 .
age -0.031764 0.033933 -0.936 0.3492
visitsOnce -0.413043 0.424527 -0.973 0.3306
visitsMoreThanOnce -0.148285 0.420965 -0.352 0.7247
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 221.91 on 183 degrees of freedom
AIC: 233.91
Number of Fisher Scoring iterations: 4
Linear mixed models
We first make the artificial center variable as described in the exercise. You should of course never make such a construct unless it reflects the data collection.
### Make artificial center variableset.seed(123)center <-sample(rep(1:19, each=10)[1:189])birthData <-mutate(birthData,center=factor(center))
We use the new variable as a random factor in a linear mixed model (LMM) and identify the relevant estimate for smoking to be 244 g (SE 107). This is exactly the same as in the model without random effects (question 19). That would typically not be the case, but happens here because the center-to-center standard deviation is essentially estimated to zero (3.109e-06), which is not too surprising since the center variable had nothing to do with actual data collection. Notice that the summary of an lmer object does not provide p-values. There are ways to get those p-values, for example by using emmeans or the lmerTest package instead of the the lme4 package, but we will not go into details about that here.
### Remember to install lme4 before this can run# install.packages("lme4")library(lme4)
Indlæser krævet pakke: Matrix
Vedhæfter pakke: 'Matrix'
De følgende objekter er maskerede fra 'package:tidyr':
expand, pack, unpack
### Linear mixed model with random effect of centerlmm1 <-lmer(bwt ~ lwt + smoke + age + visits + (1|center), data=birthData)
boundary (singular) fit: see help('isSingular')
summary(lmm1)
Linear mixed model fit by REML ['lmerMod']
Formula: bwt ~ lwt + smoke + age + visits + (1 | center)
Data: birthData
REML criterion at convergence: 2958
Scaled residuals:
Min 1Q Median 3Q Max
-2.85637 -0.69380 0.00926 0.71625 2.47412
Random effects:
Groups Name Variance Std.Dev.
center (Intercept) 9.668e-12 3.109e-06
Residual 5.022e+05 7.086e+02
Number of obs: 189, groups: center, 19
Fixed effects:
Estimate Std. Error t value
(Intercept) 2346.442 301.420 7.785
lwt 4.167 1.727 2.413
smoke1 -244.402 107.187 -2.280
age 4.300 10.252 0.419
visitsOnce 184.314 130.194 1.416
visitsMoreThanOnce 32.318 133.410 0.242
Correlation of Fixed Effects:
(Intr) lwt smoke1 age vstsOn
lwt -0.613
smoke1 -0.196 0.046
age -0.620 -0.173 0.003
visitsOnce -0.026 0.048 0.160 -0.218
vstsMrThnOn 0.055 -0.058 0.031 -0.190 0.335
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')