What if you didn’t run the most perfect tightly controlled study and you’re worried there might be some covariate influencing your dataset. We can do that by running a quasi experiment using an Analysis of Covariance (ANCOVA) model. While it doesn’t perfectly control for the confound, it’s a useful tool to mitigate the noise it introduces into your data.
In this example I’ll be using the dataset mtcars included in base R.
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Let’s trim down the number of variables based on the question we’re going to ask. Let’s say we’re interested if cars with differing amount of gears have significantly different weights, removing the variance associated with miles per gallon (mpg).
# install.packages("dplyr")
library(dplyr)
mtcars = mtcars %>%
dplyr::select("gear", "mpg", "wt")
mtcars$gear = factor(mtcars$gear, label = c("3", "4", "5"))
It’s necessary to make gear as a factor for the levene’s test and multiple comparisons to work later.
ANCOVA
ANCOVAs are basically the love child of regressions and ANOVAs. They’re actually a type of multiple regression. We’re going to add a regression variable to our ANOVA. As a reminder a OLS regression looks like: $$Y_i = \beta_0 + \beta_1(X_i) + \epsilon_i$$.
And a quick refresher on the ANOVA matrix notation: $$Y_ij = \mu + \alpha_i + \epsilon_ij$$
When we combine them together we can get something that looks like: $$Y_{ij} = \mu + \alpha_i + \beta(x_{ij} - \overline{x}) + \epsilon_{ij} $$ Note this is the centered version of the equation.
- $\mu$ = is a constant for all observations
- $\alpha$ = treatment factor e.g. group
- $\beta$ = beta estimate for the covariate
- $\epsilon$ = error term
Assumptions
- Parallel slopes e.g. there are no interactions
- Homogeneity
- Covariate is unrelated to independent variable
- Homoscedasticity of slopes
Homogeneity
To test for homogenity we can use a levene’s test. I might do a future blog post on homogeneity and levene’s test in the future.
#install.packages("car")
library(car)
leveneTest(wt ~ gear, center = mean, data = mtcars)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 2 0.1444 0.8661
## 29
Because the test was not significant we can conclude the homogeneity assumption is true.
Interaction Model
We want lines to be parallel to each other. This suggests there isn’t an interaction between confound and the treatment.
What is an interaction? Hmmm good question. It happens it makes our story harder to explain. Basically it means when two variables are looked at to see how they affect a third variable, the first two variables affect the third variable in more than just an additive way. Think of the good ol’ days when your caffeine came straight in your four loko. The interaction between caffeine and alcohol is an interaction that is more than just the addition of it’s parts. Some might say it’s gold.
Back to the problem at hand. Let’s test if we have parallel lines. You must enter your covariate first. In this case mpg.
fit2 <- aov(wt~mpg*gear,data = mtcars)
summary(fit2)
## Df Sum Sq Mean Sq F value Pr(>F)
## mpg 1 22.343 22.343 109.975 7.81e-11 ***
## gear 2 1.117 0.558 2.748 0.0826 .
## mpg:gear 2 0.937 0.468 2.305 0.1198
## Residuals 26 5.282 0.203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Weight is the y variable and we’re looking checking if there is an interaction (non-parallel lines) between gear and mpg in relation to weight Since mpg:wt is insignificant, we can conclude the parallel lines assumption is true. Although, I’m personally wary of any thing that is close to what we consider significance because it’s an arbitrary accepted practice in science. Regardless, for the purpose of this post, we’ll keep with the assumption.
Additive Model
This is the model we can use if the parallel lines assumption is not broken, which in our case it is not.
fit1 <- aov(wt~mpg+gear,data = mtcars)
summary(fit1)
## Df Sum Sq Mean Sq F value Pr(>F)
## mpg 1 22.343 22.343 100.598 9.01e-11 ***
## gear 2 1.117 0.558 2.514 0.0991 .
## Residuals 28 6.219 0.222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note on the code here - you can also type aov(lm(wt~am+gear,data = mtcars)), which would also be correct. For conciseness and clarity, I code it as the example above. Also, I’m using an uncentered ANCOVA in this example which has a slightly different equation than the one I have listed above. It typically doesn’t change the conclusions you would make with the model.
Significance test
How is the ANCOVA model determining significance? It uses an F test to get there, that looks pretty similar to the ANOVA with some additions.
$$F = \frac{(\frac{SSE_R - SSE_F}{df_R - df_F})}{(\frac{SSE_F}{df_F})}$$
- $SSE_R$ is the sum of squares error for the reduced model.
- $SSE_F$ is the sum of squares error for the full model.
- $df_R$ is the error degrees of freedom for reduced model.
- $df_F$ is the error degrees of freedom for the full model.
Multiple Comparisons
To test multiple comparisons we’re going to need the multcomp library. The ghlt function allows to test our general linear hypothesis and the mcp function specifies the linear hypothesis.
#install.packages("multcomp")
library(multcomp)
posthoc <- glht(fit1, linfct = mcp(gear = "Tukey"))
summary(posthoc)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = wt ~ mpg + gear, data = mtcars)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 4 - 3 == 0 -0.1948 0.2405 -0.810 0.6998
## 5 - 3 == 0 -0.5834 0.2624 -2.224 0.0846 .
## 5 - 4 == 0 -0.3886 0.2576 -1.509 0.3019
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
To see the adjusted means by groups we’re going to use the effects library.
#install.packages(effects)
library(effects)
effect("gear", fit1)
##
## gear effect
## gear
## 3 4 5
## 3.381463 3.186661 2.798025
We can conclude, after adjusting for mpg, there are no significant differences in weight for cars with different gears.
Appropriate Use of ANCOVAs
You want to be careful when using ANOVAs. They are not fit to answer some problems. For example, if we chose to ask the question, does mpg differ among cars with different amount of gears, removing the variance associated with weight? This is an honest question to ask, but mpg is inherently associated with the weight of a car. Cars that weigh more are going to get worse mpg. Some cars might have more fuel efficient engines but weight is closely associated with mpg and removing its variance would most likely remove significant differences from your outcome.
In the next post, I’ll discuss MANCOVAs.
References
http://users.stat.umn.edu/~helwig/notes/acov-Notes.pdf
http://www.stat.cmu.edu/~hseltman/309/Book/chapter10.pdf
https://web.ma.utexas.edu/users/mks/384Esp08/ANCOVA.pdf
https://pdfs.semanticscholar.org/e6d5/8a8a0dd660f6bbaba8d88137735d40124b05.pdf
http://faculty.missouri.edu/huangf/data/quantf/ancova_in_r_handout.pdf