ANOVA vs regression

Thanks to this article, I started to get it. Here's my re-explanation.

ANOVA is one of many ways to carry out linear regression. It is in many ways a shorthand, and it's limited to work only with categorical variables.

Consider the simple linear regression with two predictors

Y_i = β₀ + β₁ X_1i + β₂ X_2i + ε_i

which in the R language may correspond to the function call lm(Y ~ X1 + X2). Sidenote: the full model hasn't been specified yet, such as chosen assumptions, so it could better correspond to glm(Y ~ X1 + X2) or some other function invocation depending on your theory.

Now consider the ANOVA model with two predictors, written like this:

Y_ijk = μ + α_j + β_k + ε_ijk

Hopefully you know regression well enough that I don't have to explain the earlier regression equation; the real secret is here, in the ANOVA notation. Key insights:

X₁ and X₂ are present, just not written!
μ is the grand mean.
ε_ijk is the residual for individual i who is in treatment j and group k.
In preparation for ANOVA, you usually have to mutate X₁ and X₂ (change the dataset) because they must be effect-coded, while for regression they are often dummy-coded.
- Effect-coded means that the categories are coded with 1’s and -1 so that each category’s mean is compared to the grand mean.
- Dummy-coded means that each category’s intercept is compared to the reference group‘s intercept.

In dummy coding, if X₁ is gender, a value of X_1i = 0 might mean that individual i is female, and a value of 1 that they are male, so β₁ becomes the mean effect of maleness as opposed to femaleness.

In effect coding, female would instead be coded as -1 and male 1. Recognize now that there exist two coefficients α_j, though we don't write α_-1 and α₁, but α₀ and α₁, or if you prefer, α_f and α_m.

Q: But what does that mean, what is α₁ if not the mean effect of maleness as opposed to femaleness?
A: Well, the implication of leaving femaleness a default state means that they are included in the group mean, that is, the female assumption is baked into the regression intercept β₀, where this is not the case for the ANOVA group mean μ. That is how β₀ ≠ μ.

In a given instance of calculation, individual i is known to belong to either group j=0 or j=1, so for that individual we employ either one of the following formulas. For clarity, we show the X variables here.

Y_i0k = μ + α₀ X_1i + β_k X_2i + ε_i0k
Y_i1k = μ + α₁ X_1i + β_k X_2i + ε_i1k

Pay attention! The 0 and 1 subscripts in X₁ and X₂ are not the same here as the 0's and 1's elsewhere. This may be why we generally don't like to write out X₁ and X₂, it gets too confusing. I think in some textbooks they use Roman numerals sometimes?

Now, dummy coding is a bit easier to grasp because you don't need to think of α as a vector {α₀, α₁}, the regression coefficient β₁ is literally just a scalar, a single value. Going down the list of individuals from i=1 to i=n, the formula is the same every time. In carrying out ANOVA, the formula keeps changing cause there is not just a vector of individuals i ∈ {1, 2, 3, … n} to iterate through, there is a vector of means α_j ∈ {α₀, α₁} and another vector of means β_k ∈ {β₀, β₁}.

It might not be clear what is the benefit of designing the calculation this way, when it seems just as good to have dummy coding, but it becomes apparent when you realize you can grow these vectors to any size. Perhaps β_k ∈ {β₀, β₁, β₂, β₃, β₄}. That would be a bit inconvenient to do with dummy coding – you would need four new terms in the formula.

Of course computers can take care of that behind the scenes, so perhaps ANOVA is a relic from the pen-and-paper era.

A gotcha: It's worth observing that in ANOVA, i does not go from i=1 to i=n as you are used to. Consider what I said before: "ε_ijk is the residual for individual i who is in treatment j and group k". So does that mean there is a data row for individual i=40 in treatment 0 and group 0, and another data row showing that same individual i=40 in treatment 1 and group 0? Of course not. Instead, i refers to your indice inside the group. You might be person i=7 in treatment 0 and group 0, different from person i=7 in treatment 1 and group 1. If there are two treatments and two groups, there are four combinations, and we split the overall index 1:N (we write capital N where ordinary regression is content with lowercase n) into four indexes, each 1:n. where n may simply be N/4 (or is it N/2?) or the groups could be of differing sizes, so that you have several group sizes {n_α, n_β} but they must of course add up to the overall sample size N.

We're not done yet, but take a break! Stretch your legs.

Let's spit out some analyses! Taken from this article.

Here's a linear model which we will run as both ANOVA and ordinary regression:

Experience ~ Employment

The Employment variable is categorical, the categories being either "Clerical", "Custodial" or "Manager". The Experience variable is job experience in months (numeric, not categorical – the outcome is never categorical).

If we run this as an ANOVA model, we find that the means of the three groups are:

Clerical:   85.039
Custodial: 298.111
Manager:    77.619

If we run this as a regression, we find these coefficients:

Intercept:  77.619
Clerical:    7.420
Custodial: 220.492

What can you observe? Stop reading and try to answer, then read on.

First, what do the numbers mean? They are job experience in months – as I said, but it's too easy to skim or forget that.
The fact that the Intercept is identical to the Manager effect is an artifact of the fact we only have one dependent variable, so it's responsible for the whole contribution to μ.
The regression Intercept + Clerical is equal to ANOVA Clerical.
The regression Intercept + Custodial is equal to ANOVA Custodial.

By the way, what are the equivalents to μ etc? Once again,

ANOVA

Clerical:   85.039     call this alpha1
Custodial: 298.111     call this alpha2
Manager:    77.619     call this alpha0

Regression

Intercept:  77.619      call this beta0
Clerical:    7.420      call this beta1
Custodial: 220.492      call this beta2

I noticed after the fact there is no μ above, seems it is an artifact of the printout (the article did it in SPSS), as I guess the calculation should have had a nonzero μ. Still, try it yourself with any dataset.

In ANOVA, you test H₀: μ₁ = μ₂ = μ_n.

If you are testing for the means of two samples, a T-Test could be the right statistical test. What if you have multiple groups? Instead of running multiple pairs of T-Test you can use an ANOVA, allowing to test for equality of their means all in one shot.

Running multiple pairs of tests increases your chance of type I error. For example, if you run 3 different hypothesis tests using a 95% confidence on each, your total confidence ends up being 0.857 (.95 raised to the third). Running an ANOVA will maintain your desired confidence level.

Created 2021-Aug-30 (3 years ago)