Linear Model Fitting

We will be using the Palmer penguins data to understand body mass.

Fit a model using lm(Y ~ X1 + X2 +..., data = dataframe)

The function to fit a linear regression model in R is lm(Y ~ X1 + X2 +..., data = mydataframe). lm, as many other functions in R, uses the formula interface The tilde (~) can be translated as is a function of. We are saying that \(Y\) is a function of \(X1\), \(X2\), etc., and the data for our analysis comes from mydataframe.

Back to our penguins, we want to see whether body mass is a function of flipper length. We create an object called model1 that holds the results of this linear regression model.

model1 <- lm(body_mass_g ~ flipper_length_mm, data = penguins)

Look at model output

We will be using the broom package to make modelling easier to work with. There are 3 main functions in broom:

  • tidy() - This is where you get most of the output you want, including coefficients and p-values
  • glance() - additional measures on your model, including R-squared, log likelihood, and AIC/BIC
  • augment() - make predictions with your model using new data

For now, we will use broom::tidy() and broom::glance() to get model results.

model1 %>% broom::tidy()
termestimatestd.errorstatisticp.value
(Intercept)-5.78e+03306   -18.95.59e-55 
flipper_length_mm49.7     1.5232.74.37e-107
model1 %>% broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.7590.7583941.07e+034.37e-1071-2.53e+035.06e+035.07e+035.29e+07340342

Write down the equation for model1

\[ \text{body_mass_g} = \alpha + \beta_{1}(\text{flipper_length_mm}) + \epsilon \] \[ \text{body_mass_g} = -5780.83 + 49.69(\text{flipper_length_mm}) + \epsilon \]

Plot scatterplot and residuals

Add more explanatory variables

model2 <- lm(body_mass_g ~ flipper_length_mm + species , data = penguins)

model3 <- lm(body_mass_g ~ flipper_length_mm + species + sex , data = penguins)

model4 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm, data = penguins)

model5 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm + bill_depth_mm , data = penguins)

# Fit a model with all explanatory variables (~ .)
model6 <- lm(body_mass_g ~ . , data = penguins)

Check collinearity

With so many explanatory variables, we need to worry about colinearity, i.e., whether the explanatory variables (all of the \(X\)’s) are highly correlated among themselves.

#model2 <- lm(body_mass_g ~ flipper_length_mm + species , data = penguins)
car::vif(model2)
##                   GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm 4.51  1            2.12
## species           4.51  2            1.46
# model3 <- lm(body_mass_g ~ flipper_length_mm + species + sex , data = penguins)
car::vif(model3)
##                   GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm 6.05  1            2.46
## species           5.65  2            1.54
## sex               1.36  1            1.17
# model4 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm, data = penguins)
car::vif(model4)
##                    GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm  6.44  1            2.54
## species           18.16  2            2.06
## sex                1.81  1            1.35
## bill_length_mm     5.95  1            2.44
# model5 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm + bill_depth_mm , data = penguins)
car::vif(model5)
##                    GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm  6.69  1            2.59
## species           41.07  2            2.53
## sex                2.31  1            1.52
## bill_length_mm     6.07  1            2.46
## bill_depth_mm      6.08  1            2.47
# model6 <- lm(body_mass_g ~ . , data = penguins)
car::vif(model6)
##                    GVIF Df GVIF^(1/(2*Df))
## species           71.20  2            2.90
## island             3.76  2            1.39
## bill_length_mm     6.12  1            2.47
## bill_depth_mm      6.27  1            2.50
## flipper_length_mm  7.78  1            2.79
## sex                2.34  1            1.53
## year               1.17  1            1.08
model7 <- lm(body_mass_g ~ flipper_length_mm +  sex + bill_depth_mm , data = penguins)
car::vif(model7)
## flipper_length_mm               sex     bill_depth_mm 
##              2.44              1.89              2.65

Summary model comparison table using huxtable::huxreg()

Which of the six models we have fit seems to be the best one? Let us compare them on one table.

huxreg(model1, model2, model3, model4, model5, model6, model7,
                 statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma'), 
                 bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)(6)(7)
(Intercept)-5780.831 ***-4031.477 ***-365.817    -759.064    -1460.995 *  84087.945 *  -2246.829 ***
(305.815)   (584.151)   (532.050)   (541.377)   (571.308)   (41912.019)   (625.286)   
flipper_length_mm49.686 ***40.705 ***20.025 ***17.847 ***15.950 ***18.504 ***38.190 ***
(1.518)   (3.071)   (2.846)   (2.902)   (2.910)   (3.128)   (2.084)   
speciesChinstrap        -206.510 ***-87.634    -291.711 ***-251.477 ** -282.539 **         
        (57.731)   (46.347)   (81.502)   (81.079)   (88.790)           
speciesGentoo        266.810 ** 836.260 ***707.028 ***1014.627 ***890.958 ***        
        (95.264)   (85.185)   (94.359)   (129.561)   (144.563)           
sexmale                530.381 ***465.395 ***389.892 ***378.977 ***538.080 ***
                (37.810)   (43.081)   (47.848)   (48.074)   (51.310)   
bill_length_mm                        21.633 ** 18.204 *  18.964 **         
                        (7.148)   (7.106)   (7.112)           
bill_depth_mm                                67.218 ***60.798 ** -86.947 ***
                                (19.742)   (20.002)   (15.456)   
islandDream                                        -21.180            
                                        (58.390)           
islandTorgersen                                        -58.777            
                                        (60.852)           
year                                        -42.785 *          
                                        (20.949)           
#observations342        342        333        333        333        333        333        
R squared0.759    0.783    0.867    0.871    0.875    0.877    0.823    
Adj. R Squared0.758    0.781    0.865    0.869    0.873    0.873    0.821    
Residual SE394.278    375.535    295.565    291.955    287.338    286.524    340.427    
*** p < 0.001; ** p < 0.01; * p < 0.05.

The best model seems to be model 7, so we will use broom::tidy() and broom::glance() to get model results.

model7 %>% broom::tidy()
termestimatestd.errorstatisticp.value
(Intercept)-2.25e+03625   -3.590.000376
flipper_length_mm38.2     2.0818.3 3.47e-52
sexmale538       51.3 10.5 2.17e-22
bill_depth_mm-86.9     15.5 -5.633.96e-08
model7 %>% broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.8230.8213405092.9e-1233-2.41e+034.83e+034.85e+033.81e+07329333

Let us write down its equation

\[ \begin{aligned} \text{body_mass_g} &= \alpha + \beta_{1}(\text{flipper_length_mm}) + \beta_{2}(\text{sex}_{\text{male}})\ + \beta_{3}(\text{bill_depth_mm}) + \epsilon \end{aligned} \]\[ \begin{aligned} \text{body_mass_g} &= -2246.83 + 38.19(\text{flipper_length_mm}) + 538.08(\text{sex}_{\text{male}})\ - 86.95(\text{bill_depth_mm}) + \epsilon \end{aligned} \]

Fitting multiple regression models in one go

Let us recall the relationship between body mass and bill depth and have a look at the scatterplot.

We could run three separate regression, but we can estimate three regression models with a few lines of code.

penguins %>%
  na.omit() %>% 
  group_by(species) %>%
  summarise(
    broom::tidy(lm( body_mass_g ~ bill_depth_mm))
  )
## # A tibble: 6 x 6
## # Groups:   species [3]
##   species   term          estimate std.error statistic  p.value
##   <fct>     <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 Adelie    (Intercept)     -297.      469.    -0.634  5.27e- 1
## 2 Adelie    bill_depth_mm    218.       25.5    8.55   1.67e-14
## 3 Chinstrap (Intercept)      -36.2     613.    -0.0591 9.53e- 1
## 4 Chinstrap bill_depth_mm    205.       33.2    6.16   4.79e- 8
## 5 Gentoo    (Intercept)     -422.      488.    -0.864  3.89e- 1
## 6 Gentoo    bill_depth_mm    368.       32.5   11.3    1.64e-20

What we see is BLAH…

What if we add sex? First, let us facet_wrap() our scatter plot to see what it looks like

penguins %>%
  na.omit() %>% 
  group_by(species) %>%
  summarise(
    broom::tidy(lm( body_mass_g ~ bill_depth_mm + sex))
  )
## # A tibble: 9 x 6
## # Groups:   species [3]
##   species   term          estimate std.error statistic  p.value
##   <fct>     <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 Adelie    (Intercept)     1931.      452.      4.28  3.47e- 5
## 2 Adelie    bill_depth_mm     81.6      25.6     3.19  1.74e- 3
## 3 Adelie    sexmale          556.       62.1     8.96  1.63e-15
## 4 Chinstrap (Intercept)      830.      861.      0.964 3.39e- 1
## 5 Chinstrap bill_depth_mm    153.       48.9     3.14  2.55e- 3
## 6 Chinstrap sexmale          156.      110.      1.42  1.60e- 1
## 7 Gentoo    (Intercept)     2741.      579.      4.73  6.37e- 6
## 8 Gentoo    bill_depth_mm    136.       40.6     3.35  1.08e- 3
## 9 Gentoo    sexmale          604.       79.8     7.57  9.94e-12

Simpson’s paradox

Recall from our EDA, we saw no relationship between bill length and depth.

bill_no_species

If we fit a simple regression model, we get the following

simpsons_model <- lm( bill_depth_mm ~ bill_length_mm, 
                      data = penguins %>% na.omit())

simpsons_model %>% broom::tidy()
termestimatestd.errorstatisticp.value
(Intercept)20.8   0.854 24.3 1.03e-75
bill_length_mm-0.08230.0193-4.272.53e-05
simpsons_model %>% broom::glance()

r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.05230.04941.9218.32.53e-051-6891.38e+031.39e+031.22e+03331333
The slope is significant, but the model r.squared (R2) explains only 5% of the overall variability.

However, when we plotted the same scatterplot colouring points by species, we got a completely different story.

bill_len_dep

We can again fit three individual models in one go

penguins %>%
  na.omit() %>% 
  group_by(species) %>%
  summarise(
    broom::tidy(lm( bill_depth_mm ~ bill_length_mm )),
    broom::glance(lm( bill_depth_mm ~ bill_length_mm ))
  )
## # A tibble: 6 x 16
## # Groups:   species [3]
##   species term  estimate std.error statistic  p.value r.squared adj.r.squared
##   <fct>   <chr>    <dbl>     <dbl>     <dbl>    <dbl>     <dbl>         <dbl>
## 1 Adelie  (Int~   11.5      1.37        25.2 1.51e- 6     0.149         0.143
## 2 Adelie  bill~    0.177    0.0352      25.2 1.51e- 6     0.149         0.143
## 3 Chinst~ (Int~    7.57     1.55        49.2 1.53e- 9     0.427         0.418
## 4 Chinst~ bill~    0.222    0.0317      49.2 1.53e- 9     0.427         0.418
## 5 Gentoo  (Int~    5.12     1.06        87.5 7.34e-16     0.428         0.423
## 6 Gentoo  bill~    0.208    0.0222      87.5 7.34e-16     0.428         0.423
## # ... with 8 more variables: sigma <dbl>, df <dbl>, logLik <dbl>, AIC <dbl>,
## #   BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>

Alternatively, we can run a multiple regression model and the adjusted R2 = 76.5%

simpsons_model2 <- lm( bill_depth_mm ~ bill_length_mm + species, 
                      data = penguins %>% na.omit())

simpsons_model2 %>% broom::tidy()
termestimatestd.errorstatisticp.value
(Intercept)10.6 0.691 15.3 2.98e-40
bill_length_mm0.2 0.017711.3 2.26e-25
speciesChinstrap-1.930.226 -8.564.26e-16
speciesGentoo-5.1 0.194 -26.3 1.04e-82
simpsons_model2 %>% broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.7670.7650.9543628.88e-1043-455920939300329333