One-Way Analysis of Variance with R

One-Way Analysis of Variance (ANOVA) is a technique for studying the relationship between a quantitative dependent variable and a single qualitative independent variable. Usually we are interested in whether the level of the dependent variable differs for different values of the qualitative variable. We will use as an example real data from a study reported in 1935 by B. Lowe of the Iowa Agricultural Experiment Station.* Perhaps this originated at coffee break one morning. Donuts are traditionally a fried food and as such absorb some of the fat they are fried in. The amount and type of fat absorbed has implications for the healthfulness of the donuts. This study investigated whether there was any relationship between the quantitative variable "amount of fat absorbed" and a qualitative variable "type of fat". (Unfortunately we do not know just what the fats were. You could think of them as corn oil, soybean oil, lard, and Quaker State.) You can find the data at our site as a plain text file and as an Excel spreadsheet. Download the text file now and save it to the directory where you installed R.

Reading Tables into R

R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. To use the file you just downloaded in R you must define a variable to be equal to the contents of this file.

donuts <- read.table("donuts.txt",header=TRUE)

The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. You can now get a table of contents for what you have created in R with

> objects() 

This should return donuts along with any other variables you may have created. You will not see on this list any of the variables that are inside of donuts because they are hiding. To see them, type

> names(donuts)
[1] "Fat1" "Fat2" "Fat3" "Fat4"

To bring them out of hiding, you must attach them to your R workspace.

> attach(donuts)

Then you can work with them providing you remember that R is case-sensitive.

> Fat1
[1] 164 172 168 177 156 195

Stacking Data

ANOVA is commonly used with experimental studies and that is the case here. The experiment consists of frying some donuts in each of four fats. Twenty-four batches of donuts were prepared and six randomly assigned to each of the four fats. The results, in grams of fat absorbed for each batch, and as they might commonly be laid out on a page (and are laid out in the file) were:

Fat1 Fat2 Fat3 Fat4
164 178 175 155
172 191 193 166
168 197 178 149
177 182 171 164
156 185 163 170
195 177 176 168

While this is a reasonable arrangement for purposes of page layout, and very common for this type of data, it obscures the structure of the data and may confuse statistical software. The experimental units here are batches of donuts, and for each batch we write down two things: one value for the quantitative variable fat absorbed and one value for the qualitative variable type of fat. Here is how R can change the page layout format into a format that is more logical and easier for statistical software to deal with.

> sdonuts <- stack(donuts)
> sdonuts
   values  ind
1     164 Fat1
2     172 Fat1
3     168 Fat1
4     177 Fat1
5     156 Fat1
6     195 Fat1
7     178 Fat2
8     191 Fat2
9     197 Fat2
10    182 Fat2
11    185 Fat2
12    177 Fat2
13    175 Fat3
14    193 Fat3
15    178 Fat3
16    171 Fat3
17    163 Fat3
18    176 Fat3
19    155 Fat4
20    166 Fat4
21    149 Fat4
22    164 Fat4
23    170 Fat4
24    168 Fat4

Simple Summaries

We can compare the four fats by looking at summary statistics or at parallel boxplots.

> summary(donuts)
      Fat1            Fat2            Fat3            Fat4      
 Min.   :156.0   Min.   :177.0   Min.   :163.0   Min.   :149.0  
 1st Qu.:165.0   1st Qu.:179.0   1st Qu.:172.0   1st Qu.:157.3  
 Median :170.0   Median :183.5   Median :175.5   Median :165.0  
 Mean   :172.0   Mean   :185.0   Mean   :176.0   Mean   :162.0  
 3rd Qu.:175.8   3rd Qu.:189.5   3rd Qu.:177.5   3rd Qu.:167.5  
 Max.   :195.0   Max.   :197.0   Max.   :193.0   Max.   :170.0  

> summary(sdonuts)
     values        ind   
 Min.   :149.0   Fat1:6  
 1st Qu.:165.5   Fat2:6  
 Median :173.5   Fat3:6  
 Mean   :173.8   Fat4:6  
 3rd Qu.:179.0           
 Max.   :197.0  

> attach(sdonuts)
> boxplot(values ~ ind)


It certainly looks like more of Fat 2 gets absorbed while Fat 4 seems least absorbed. But wait a minute! If we repeated the experiment we would most likely get different numbers. Could this change the rankings of the fats? Is it possible that all four fats are absorbed to about the same degree and we are just seeing random fluctuations from one assignment of batches to fats to another? To see if that is likely we do a hypothesis test. The null as usual is backwards: we hypothesize no difference among the fats. As always, the null provides a specific model with which we can play "what if". If the null were true, would such differences be ordinary or extraordinary?

> oneway.test(values ~ ind, var.equal=TRUE)

        One-way analysis of means

data:  values and ind 
F = 5.4063, num df = 3, denom df = 20, p-value = 0.006876

(The code var.equal=TRUE gives us the usual ANOVA test covered in textbooks.) The p-value of 0.006876 is for a test of the hypothesis that the mean amount of fat absorbed is the same for all four types of fat. Because it is so small, we reject the hypothesis of equal absorption.

Like any statistical test, this one is based on some assumptions. We will only mention the ones we can check with software. These are two: that the numbers for each fat are normally distributed and that they share a common variance. We can check these roughly from the boxplots. There we see roughly similar spreads and no serious departures from normality.

If we see signs the assumptions are not met then the remedies are similar to what they are in the univariate case. For example, outliers or bimodality must be investigated as to their cause. A transformation of the dependent variable may help just as it can in the univariate case. However, it is most likely to be effective if all the groups are skewed, and in the same direction, or if there is a systematic change in variability with amount of fat absorbed.

*Our source is Chapter 12 of Snedecor and Cochran, Statistical Methods (7th. ed.), 1980, Iowa State University Press, Ames, IA.

© 2007 Robert W. Hayden