#statistics
Thanks to this article, I started to get it. Here's my re-explanation.
ANOVA is one of many ways to carry out linear regression. It is in many ways a shorthand, and it's limited to work only with categorical variables.
Consider the simple linear regression with two predictors
Yi = β0 + β1 X1i + β2 X2i + εi
which in the R language may correspond to the function call lm(Y ~ X1 + X2)
. Sidenote: the full model hasn't been specified yet, such as chosen assumptions, so it could better correspond to glm(Y ~ X1 + X2)
or some other function invocation depending on your theory.
Now consider the ANOVA model with two predictors, written like this:
Yijk = μ + αj + βk + εijk
Hopefully you know regression well enough that I don't have to explain the earlier regression equation; the real secret is here, in the ANOVA notation. Key insights:
- X1 and X2 are present, just not written!
- μ is the grand mean.
- εijk is the residual for individual i who is in treatment j and group k.
- In preparation for ANOVA, you usually have to mutate X1 and X2 (change the dataset) because they must be effect-coded, while for regression they are often dummy-coded.
- Effect-coded means that the categories are coded with 1’s and -1 so that each category’s mean is compared to the grand mean.
- Dummy-coded means that each category’s intercept is compared to the reference group‘s intercept.
In dummy coding, if X1 is gender, a value of X1i = 0 might mean that individual i is female, and a value of 1 that they are male, so β1 becomes the mean effect of maleness as opposed to femaleness.
In effect coding, female would instead be coded as -1 and male 1. Recognize now that there exist two coefficients αj, though we don't write α-1 and α1, but α0 and α1, or if you prefer, αf and αm.
Q: But what does that mean, what is α1 if not the mean effect of maleness as opposed to femaleness?
A: Well, the implication of leaving femaleness a default state means that they are included in the group mean, that is, the female assumption is baked into the regression intercept β0, where this is not the case for the ANOVA group mean μ. That is how β0 ≠ μ.
In a given instance of calculation, individual i is known to belong to either group j=0 or j=1, so for that individual we employ either one of the following formulas. For clarity, we show the X variables here.
Yi0k = μ + α0 X1i + βk X2i + εi0k
Yi1k = μ + α1 X1i + βk X2i + εi1k
Pay attention! The 0 and 1 subscripts in X1 and X2 are not the same here as the 0's and 1's elsewhere. This may be why we generally don't like to write out X1 and X2, it gets too confusing. I think in some textbooks they use Roman numerals sometimes?
Now, dummy coding is a bit easier to grasp because you don't need to think of α as a vector {α0, α1}, the regression coefficient β1 is literally just a scalar, a single value. Going down the list of individuals from i=1 to i=n, the formula is the same every time. In carrying out ANOVA, the formula keeps changing cause there is not just a vector of individuals i ∈ {1, 2, 3, … n} to iterate through, there is a vector of means αj ∈ {α0, α1} and another vector of means βk ∈ {β0, β1}.
It might not be clear what is the benefit of designing the calculation this way, when it seems just as good to have dummy coding, but it becomes apparent when you realize you can grow these vectors to any size. Perhaps βk ∈ {β0, β1, β2, β3, β4}. That would be a bit inconvenient to do with dummy coding – you would need four new terms in the formula.
Of course computers can take care of that behind the scenes, so perhaps ANOVA is a relic from the pen-and-paper era.
A gotcha: It's worth observing that in ANOVA, i does not go from i=1 to i=n as you are used to. Consider what I said before: "εijk is the residual for individual i who is in treatment j and group k". So does that mean there is a data row for individual i=40 in treatment 0 and group 0, and another data row showing that same individual i=40 in treatment 1 and group 0? Of course not. Instead, i refers to your indice inside the group. You might be person i=7 in treatment 0 and group 0, different from person i=7 in treatment 1 and group 1. If there are two treatments and two groups, there are four combinations, and we split the overall index 1:N (we write capital N where ordinary regression is content with lowercase n) into four indexes, each 1:n. where n may simply be N/4 (or is it N/2?) or the groups could be of differing sizes, so that you have several group sizes {nα, nβ} but they must of course add up to the overall sample size N.
We're not done yet, but take a break! Stretch your legs.
Let's spit out some analyses! Taken from this article.
Here's a linear model which we will run as both ANOVA and ordinary regression:
Experience ~ Employment
The Employment variable is categorical, the categories being either "Clerical", "Custodial" or "Manager". The Experience variable is job experience in months (numeric, not categorical – the outcome is never categorical).
If we run this as an ANOVA model, we find that the means of the three groups are:
Clerical: 85.039
Custodial: 298.111
Manager: 77.619
If we run this as a regression, we find these coefficients:
Intercept: 77.619
Clerical: 7.420
Custodial: 220.492
What can you observe? Stop reading and try to answer, then read on.
- First, what do the numbers mean? They are job experience in months – as I said, but it's too easy to skim or forget that.
- The fact that the Intercept is identical to the Manager effect is an artifact of the fact we only have one dependent variable, so it's responsible for the whole contribution to μ.
- The regression Intercept + Clerical is equal to ANOVA Clerical.
- The regression Intercept + Custodial is equal to ANOVA Custodial.
By the way, what are the equivalents to μ etc? Once again,
ANOVA
Clerical: 85.039 call this alpha1
Custodial: 298.111 call this alpha2
Manager: 77.619 call this alpha0
Regression
Intercept: 77.619 call this beta0
Clerical: 7.420 call this beta1
Custodial: 220.492 call this beta2
I noticed after the fact there is no μ above, seems it is an artifact of the printout (the article did it in SPSS), as I guess the calculation should have had a nonzero μ. Still, try it yourself with any dataset.
In ANOVA, you test H0: μ1 = μ2 = μn.
If you are testing for the means of two samples, a T-Test could be the right statistical test. What if you have multiple groups? Instead of running multiple pairs of T-Test you can use an ANOVA, allowing to test for equality of their means all in one shot.
Running multiple pairs of tests increases your chance of type I error. For example, if you run 3 different hypothesis tests using a 95% confidence on each, your total confidence ends up being 0.857 (.95 raised to the third). Running an ANOVA will maintain your desired confidence level.