Two-Way Analysis of Variance with R

Two-Way Analysis of Variance (ANOVA) is a technique for studying the relationship between a quantitative dependent variable and two qualitative independent variables. Usually we are interested in whether the level of the dependent variable differs for different values of the qualitative variables. We will use as an example data from a student project reported in Stats: Data and Models (2nd ed.), by De Veaux, Velleman and Bock, Addison-Wesley, 2008, Chapter 29, Exercise 11. The student was interested in her success at basketball free throws. This study investigated whether there was any relationship between the quantitative variable "number of shots 'made' (i.e., successfully completed out of 50 tries)" and two qualitative variables "Time of Day" and "Shoes Worn". ANOVA is commonly used with experimental studies and that is the case here. You can find the data at our site as a plain text file and as an Excel spreadsheet. Download the text file now and save it to the directory where you installed R.

Reading Tables into R

R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. To use the file you just downloaded in R you must define a variable to be equal to the contents of this file.

baskball <- read.table("baskball.txt",header=TRUE)

The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. (There must be a name for every variable and the names must not have spaces in them). You can now get a table of contents for what you have created in R with

> objects() 

This should return baskball along with any other variables you may have created. You will not see on this list any of the variables that are inside of baskball because they are hiding. To see them, type

> names(baskball)
[1] "Time"  "Shoes" "Made" 

To bring them out of hiding, you must attach them to your R workspace.

> attach(baskball)

Then you can work with them providing you remember that R is case-sensitive.

Time Shoes Made
Morning Others 25
Morning Others 26
Night Others 27
Night Others 27
Morning Favorite 32
Morning Favorite 22
Night Favorite 30
Night Favorite 34
Morning Others 35
Morning Others 34
Night Others 33
Night Others 30
Morning Favorite 33
Morning Favorite 37
Night Favorite 36
Night Favorite 38

We can compare the two times or the two shoes by looking at summary statistics or at parallel boxplots. To get the means for each level of each factor, use R's tapply command. This takes three arguments: the data you wish to summarize, the factor that determines the groups, and the function you wish to apply to each of the groups.

> tapply(Made,Time,mean)
Morning   Night 
 30.500  31.875 
> tapply(Made,Shoes,mean)
Favorite   Others 
  32.750   29.625 

Comparing the two sets of means, it looks like she does better at night and in her favorite shoes. But that could just be due to natural variability. We can check with ANOVA. We prefer to start with a model including interaction. R is a bit roundabout. We first run the ANOVA, store the results in a variable, and then generate a summary of those results.

> int <- aov(Made ~ Time*Shoes)
> summary(int)
            Df  Sum Sq Mean Sq F value Pr(>F)
Time         1   7.562   7.562  0.3441 0.5684
Shoes        1  39.062  39.062  1.7773 0.2072
Time:Shoes   1  18.062  18.062  0.8218 0.3825
Residuals   12 263.750  21.979

The p-value for the interaction term of 0.3825 suggests we do not have to worry about interaction so repeat the process but with a simple additive model.

> noint <- aov(Made~Time + Shoes)
> summary(noint)
            Df  Sum Sq Mean Sq F value Pr(>F)
Time         1   7.562   7.562  0.3489 0.5649
Shoes        1  39.062  39.062  1.8020 0.2024
Residuals   13 281.812  21.678 

Unfortunately the p-values for both variables in both models are quite large, suggesting that any effect we saw could well have been due to chance. However, there is an alternative interpretation: with just 16 observations, we will only be able to detect a fairly large difference. It appeared the shoes made a difference of about 3 successes in 50 tries. If that is enough of a difference to matter in practice, we might repeat the experiment with more trials. Before we do that, though, we might make some displays to see if data of this sort matches the assumptions of ANOVA. No sense gathering more if it does not;-)

> boxplot(Made ~ Time)
> boxplot(Made ~ Shoes)

We can't expect perfection with only eight numbers in each group (How could eight numbers look bell-shaped?) but there are no signs here of serious skewness or outliers. All four groups have similar variabilities.

If we see signs the assumptions are not met then the remedies are similar to what they are in the univariate case. For example, outliers or bimodality must be investigated as to their cause. A transformation of the dependent variable may help just as it can in the univariate case. However, it is most likely to be effective if all the groups are skewed, and in the same direction, or if there is a systematic change in variability as group means increase.

© 2007 Robert W. Hayden