Two-Way Analysis of Variance (ANOVA) is a technique for studying the
relationship between a quantitative dependent variable and two qualitative
independent variables. Usually we are interested in whether the level of
the dependent variable differs for different values of the qualitative
variables. We will use as an example data from a student project reported
in *Stats: Data and Models* (2nd ed.), by De Veaux, Velleman and
Bock, Addison-Wesley, 2008, Chapter 29, Exercise 11. The student was
interested in her success at basketball free throws. This study
investigated whether there was any relationship between the quantitative
variable "number of shots 'made' (i.e., successfully completed out of
50 tries)" and two qualitative variables "Time of Day" and "Shoes
Worn". ANOVA is commonly used with experimental studies and that is
the case here. You can find the data at our site as a plain
text
file and as an
Excel spreadsheet. Download the text file now and save it to the
directory where you installed R.

R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. To use the file you just downloaded in R you must define a variable to be equal to the contents of this file.

baskball <- read.table("baskball.txt",header=TRUE)

The argument `header=TRUE` tells R that the first row of the
file should be interpreted as variable names. (There must be a name for
every variable and the names must not have spaces in them). You can now
get a table of contents for what you have created in R with

> objects()

This should return `baskball` along with any other variables you
may have created. You will not see on this list any of the variables that
are inside of `baskball` because they are hiding. To see them, type

> names(baskball) [1] "Time" "Shoes" "Made"

To bring them out of hiding, you must attach them to your R workspace.

> attach(baskball)

Then you can work with them providing you remember that R is case-sensitive.

Time |
Shoes |
Made |

Morning | Others | 25 |

Morning | Others | 26 |

Night | Others | 27 |

Night | Others | 27 |

Morning | Favorite | 32 |

Morning | Favorite | 22 |

Night | Favorite | 30 |

Night | Favorite | 34 |

Morning | Others | 35 |

Morning | Others | 34 |

Night | Others | 33 |

Night | Others | 30 |

Morning | Favorite | 33 |

Morning | Favorite | 37 |

Night | Favorite | 36 |

Night | Favorite | 38 |

We can compare the two times or the two shoes by looking at summary
statistics or at parallel boxplots. To get the means for each level of
each factor, use R's `tapply` command. This takes three arguments:
the data you wish to summarize, the factor that determines the groups, and
the function you wish to apply to each of the groups.

> tapply(Made,Time,mean) Morning Night 30.500 31.875 > tapply(Made,Shoes,mean) Favorite Others 32.750 29.625

Comparing the two sets of means, it looks like she does better at night and in her favorite shoes. But that could just be due to natural variability. We can check with ANOVA. We prefer to start with a model including interaction. R is a bit roundabout. We first run the ANOVA, store the results in a variable, and then generate a summary of those results.

> int <- aov(Made ~ Time*Shoes) > summary(int) Df Sum Sq Mean Sq F value Pr(>F) Time 1 7.562 7.562 0.3441 0.5684 Shoes 1 39.062 39.062 1.7773 0.2072 Time:Shoes 1 18.062 18.062 0.8218 0.3825 Residuals 12 263.750 21.979

The *p*-value for the interaction term of 0.3825 suggests we do
not have to worry about interaction so repeat the process but with a
simple additive model.

> noint <- aov(Made~Time + Shoes) > summary(noint) Df Sum Sq Mean Sq F value Pr(>F) Time 1 7.562 7.562 0.3489 0.5649 Shoes 1 39.062 39.062 1.8020 0.2024 Residuals 13 281.812 21.678

Unfortunately the *p*-values for both variables in both models are
quite large, suggesting that any effect we saw could well have been due to
chance. However, there is an alternative interpretation: with just 16
observations, we will only be able to detect a fairly large difference. It
appeared the shoes made a difference of about 3 successes in 50 tries. If
that is enough of a difference to matter in practice, we might repeat the
experiment with more trials. Before we do that, though, we might make some
displays to see if data of this sort matches the assumptions of ANOVA. No
sense gathering more if it does not;-)

> boxplot(Made ~ Time)

> boxplot(Made ~ Shoes)

We can't expect perfection with only eight numbers in each group (How could eight numbers look bell-shaped?) but there are no signs here of serious skewness or outliers. All four groups have similar variabilities.

If we see signs the assumptions are not met then the remedies are
similar to what they are in the univariate case. For example, outliers or
bimodality must be investigated as to their cause. A transformation of the
dependent variable may help just as it can in the univariate case.
However, it is most likely to be effective if *all* the groups are
skewed, and in the same direction, or if there is a systematic change in
variability as group means increase.

© 2007 Robert W. Hayden