Linear Model Fitting

Fit a model using `lm(Y ~ X1 + X2 +..., data = dataframe)`

The function to fit a linear regression model in R is lm(Y ~ X1 + X2 +..., data = mydataframe). lm, as many other functions in R, uses the formula interface The tilde (~) can be translated as is a function of. We are saying that \(Y\) is a function of \(X1\), \(X2\), etc., and the data for our analysis comes from mydataframe.

Back to our penguins, we want to see whether body mass is a function of flipper length. We create an object called model1 that holds the results of this linear regression model.

model1 <- lm(body_mass_g ~ flipper_length_mm, data = penguins)

Look at model output

We will be using the broom package to make modelling easier to work with. There are 3 main functions in broom:

tidy() - This is where you get most of the output you want, including coefficients and p-values
glance() - additional measures on your model, including R-squared, log likelihood, and AIC/BIC
augment() - make predictions with your model using new data

For now, we will use broom::tidy() and broom::glance() to get model results.

model1 %>% broom::tidy()

term	estimate	std.error	statistic	p.value
(Intercept)	-5.78e+03	306	-18.9	5.59e-55
flipper_length_mm	49.7	1.52	32.7	4.37e-107

model1 %>% broom::glance()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.759	0.758	394	1.07e+03	4.37e-107	1	-2.53e+03	5.06e+03	5.07e+03	5.29e+07	340	342

Write down the equation for model1

\[ \text{body_mass_g} = \alpha + \beta_{1}(\text{flipper_length_mm}) + \epsilon \] \[ \text{body_mass_g} = -5780.83 + 49.69(\text{flipper_length_mm}) + \epsilon \]

Plot scatterplot and residuals

Add more explanatory variables

model2 <- lm(body_mass_g ~ flipper_length_mm + species , data = penguins)

model3 <- lm(body_mass_g ~ flipper_length_mm + species + sex , data = penguins)

model4 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm, data = penguins)

model5 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm + bill_depth_mm , data = penguins)

# Fit a model with all explanatory variables (~ .)
model6 <- lm(body_mass_g ~ . , data = penguins)

Check collinearity

With so many explanatory variables, we need to worry about colinearity, i.e., whether the explanatory variables (all of the \(X\)’s) are highly correlated among themselves.

#model2 <- lm(body_mass_g ~ flipper_length_mm + species , data = penguins)
car::vif(model2)

##                   GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm 4.51  1            2.12
## species           4.51  2            1.46

# model3 <- lm(body_mass_g ~ flipper_length_mm + species + sex , data = penguins)
car::vif(model3)

##                   GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm 6.05  1            2.46
## species           5.65  2            1.54
## sex               1.36  1            1.17

# model4 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm, data = penguins)
car::vif(model4)

##                    GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm  6.44  1            2.54
## species           18.16  2            2.06
## sex                1.81  1            1.35
## bill_length_mm     5.95  1            2.44

# model5 <- lm(body_mass_g ~ flipper_length_mm + species + sex + bill_length_mm + bill_depth_mm , data = penguins)
car::vif(model5)

##                    GVIF Df GVIF^(1/(2*Df))
## flipper_length_mm  6.69  1            2.59
## species           41.07  2            2.53
## sex                2.31  1            1.52
## bill_length_mm     6.07  1            2.46
## bill_depth_mm      6.08  1            2.47

# model6 <- lm(body_mass_g ~ . , data = penguins)
car::vif(model6)

##                    GVIF Df GVIF^(1/(2*Df))
## species           71.20  2            2.90
## island             3.76  2            1.39
## bill_length_mm     6.12  1            2.47
## bill_depth_mm      6.27  1            2.50
## flipper_length_mm  7.78  1            2.79
## sex                2.34  1            1.53
## year               1.17  1            1.08

model7 <- lm(body_mass_g ~ flipper_length_mm +  sex + bill_depth_mm , data = penguins)
car::vif(model7)

## flipper_length_mm               sex     bill_depth_mm 
##              2.44              1.89              2.65

Summary model comparison table using `huxtable::huxreg()`

Which of the six models we have fit seems to be the best one? Let us compare them on one table.

huxreg(model1, model2, model3, model4, model5, model6, model7,
                 statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma'), 
                 bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')

Comparison of models
	(1)	(2)	(3)	(4)	(5)	(6)	(7)
(Intercept)	-5780.831 ***	-4031.477 ***	-365.817	-759.064	-1460.995 *	84087.945 *	-2246.829 ***
	(305.815)	(584.151)	(532.050)	(541.377)	(571.308)	(41912.019)	(625.286)
flipper_length_mm	49.686 ***	40.705 ***	20.025 ***	17.847 ***	15.950 ***	18.504 ***	38.190 ***
	(1.518)	(3.071)	(2.846)	(2.902)	(2.910)	(3.128)	(2.084)
speciesChinstrap		-206.510 ***	-87.634	-291.711 ***	-251.477 **	-282.539 **
		(57.731)	(46.347)	(81.502)	(81.079)	(88.790)
speciesGentoo		266.810 **	836.260 ***	707.028 ***	1014.627 ***	890.958 ***
		(95.264)	(85.185)	(94.359)	(129.561)	(144.563)
sexmale			530.381 ***	465.395 ***	389.892 ***	378.977 ***	538.080 ***
			(37.810)	(43.081)	(47.848)	(48.074)	(51.310)
bill_length_mm				21.633 **	18.204 *	18.964 **
				(7.148)	(7.106)	(7.112)
bill_depth_mm					67.218 ***	60.798 **	-86.947 ***
					(19.742)	(20.002)	(15.456)
islandDream						-21.180
						(58.390)
islandTorgersen						-58.777
						(60.852)
year						-42.785 *
						(20.949)
#observations	342	342	333	333	333	333	333
R squared	0.759	0.783	0.867	0.871	0.875	0.877	0.823
Adj. R Squared	0.758	0.781	0.865	0.869	0.873	0.873	0.821
Residual SE	394.278	375.535	295.565	291.955	287.338	286.524	340.427
* p < 0.001; p < 0.01; * p < 0.05.

The best model seems to be model 7, so we will use broom::tidy() and broom::glance() to get model results.

model7 %>% broom::tidy()

term	estimate	std.error	statistic	p.value
(Intercept)	-2.25e+03	625	-3.59	0.000376
flipper_length_mm	38.2	2.08	18.3	3.47e-52
sexmale	538	51.3	10.5	2.17e-22
bill_depth_mm	-86.9	15.5	-5.63	3.96e-08

model7 %>% broom::glance()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.823	0.821	340	509	2.9e-123	3	-2.41e+03	4.83e+03	4.85e+03	3.81e+07	329	333

Let us write down its equation

\[ \begin{aligned} \text{body_mass_g} &= \alpha + \beta_{1}(\text{flipper_length_mm}) + \beta_{2}(\text{sex}_{\text{male}})\ + \beta_{3}(\text{bill_depth_mm}) + \epsilon \end{aligned} \]\[ \begin{aligned} \text{body_mass_g} &= -2246.83 + 38.19(\text{flipper_length_mm}) + 538.08(\text{sex}_{\text{male}})\ - 86.95(\text{bill_depth_mm}) + \epsilon \end{aligned} \]

Fitting multiple regression models in one go

Let us recall the relationship between body mass and bill depth and have a look at the scatterplot.

We could run three separate regression, but we can estimate three regression models with a few lines of code.

penguins %>%
  na.omit() %>% 
  group_by(species) %>%
  summarise(
    broom::tidy(lm( body_mass_g ~ bill_depth_mm))
  )

## # A tibble: 6 x 6
## # Groups:   species [3]
##   species   term          estimate std.error statistic  p.value
##   <fct>     <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 Adelie    (Intercept)     -297.      469.    -0.634  5.27e- 1
## 2 Adelie    bill_depth_mm    218.       25.5    8.55   1.67e-14
## 3 Chinstrap (Intercept)      -36.2     613.    -0.0591 9.53e- 1
## 4 Chinstrap bill_depth_mm    205.       33.2    6.16   4.79e- 8
## 5 Gentoo    (Intercept)     -422.      488.    -0.864  3.89e- 1
## 6 Gentoo    bill_depth_mm    368.       32.5   11.3    1.64e-20

What we see is BLAH…

What if we add sex? First, let us facet_wrap() our scatter plot to see what it looks like

penguins %>%
  na.omit() %>% 
  group_by(species) %>%
  summarise(
    broom::tidy(lm( body_mass_g ~ bill_depth_mm + sex))
  )

## # A tibble: 9 x 6
## # Groups:   species [3]
##   species   term          estimate std.error statistic  p.value
##   <fct>     <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 Adelie    (Intercept)     1931.      452.      4.28  3.47e- 5
## 2 Adelie    bill_depth_mm     81.6      25.6     3.19  1.74e- 3
## 3 Adelie    sexmale          556.       62.1     8.96  1.63e-15
## 4 Chinstrap (Intercept)      830.      861.      0.964 3.39e- 1
## 5 Chinstrap bill_depth_mm    153.       48.9     3.14  2.55e- 3
## 6 Chinstrap sexmale          156.      110.      1.42  1.60e- 1
## 7 Gentoo    (Intercept)     2741.      579.      4.73  6.37e- 6
## 8 Gentoo    bill_depth_mm    136.       40.6     3.35  1.08e- 3
## 9 Gentoo    sexmale          604.       79.8     7.57  9.94e-12

Simpson’s paradox

Recall from our EDA, we saw no relationship between bill length and depth.

bill_no_species

If we fit a simple regression model, we get the following

simpsons_model <- lm( bill_depth_mm ~ bill_length_mm, 
                      data = penguins %>% na.omit())

simpsons_model %>% broom::tidy()

term	estimate	std.error	statistic	p.value
(Intercept)	20.8	0.854	24.3	1.03e-75
bill_length_mm	-0.0823	0.0193	-4.27	2.53e-05

simpsons_model %>% broom::glance()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.0523	0.0494	1.92	18.3	2.53e-05	1	-689	1.38e+03	1.39e+03	1.22e+03	331	333

The slope is significant, but the model r.squared (R2) explains only 5% of the overall variability.

However, when we plotted the same scatterplot colouring points by species, we got a completely different story.

bill_len_dep

We can again fit three individual models in one go

penguins %>%
  na.omit() %>% 
  group_by(species) %>%
  summarise(
    broom::tidy(lm( bill_depth_mm ~ bill_length_mm )),
    broom::glance(lm( bill_depth_mm ~ bill_length_mm ))
  )

## # A tibble: 6 x 16
## # Groups:   species [3]
##   species term  estimate std.error statistic  p.value r.squared adj.r.squared
##   <fct>   <chr>    <dbl>     <dbl>     <dbl>    <dbl>     <dbl>         <dbl>
## 1 Adelie  (Int~   11.5      1.37        25.2 1.51e- 6     0.149         0.143
## 2 Adelie  bill~    0.177    0.0352      25.2 1.51e- 6     0.149         0.143
## 3 Chinst~ (Int~    7.57     1.55        49.2 1.53e- 9     0.427         0.418
## 4 Chinst~ bill~    0.222    0.0317      49.2 1.53e- 9     0.427         0.418
## 5 Gentoo  (Int~    5.12     1.06        87.5 7.34e-16     0.428         0.423
## 6 Gentoo  bill~    0.208    0.0222      87.5 7.34e-16     0.428         0.423
## # ... with 8 more variables: sigma <dbl>, df <dbl>, logLik <dbl>, AIC <dbl>,
## #   BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>

Alternatively, we can run a multiple regression model and the adjusted R2 = 76.5%

simpsons_model2 <- lm( bill_depth_mm ~ bill_length_mm + species, 
                      data = penguins %>% na.omit())

simpsons_model2 %>% broom::tidy()

term	estimate	std.error	statistic	p.value
(Intercept)	10.6	0.691	15.3	2.98e-40
bill_length_mm	0.2	0.0177	11.3	2.26e-25
speciesChinstrap	-1.93	0.226	-8.56	4.26e-16
speciesGentoo	-5.1	0.194	-26.3	1.04e-82

simpsons_model2 %>% broom::glance()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.767	0.765	0.954	362	8.88e-104	3	-455	920	939	300	329	333

Fit a model using lm(Y ~ X1 + X2 +..., data = dataframe)