One-Way Analysis of Variance (ANOVA) is a technique for studying the relationship between a quantitative dependent variable and a single qualitative independent variable. Usually we are interested in whether the level of the dependent variable differs for different values of the qualitative variable. We will use as an example real data from a study reported in 1935 by B. Lowe of the Iowa Agricultural Experiment Station.* Perhaps this originated at coffee break one morning. Donuts are traditionally a fried food and as such absorb some of the fat they are fried in. The amount and type of fat absorbed has implications for the healthfulness of the donuts. This study investigated whether there was any relationship between the quantitative variable "amount of fat absorbed" and a qualitative variable "type of fat". (Unfortunately we do not know just what the fats were. You could think of them as corn oil, soybean oil, lard, and Quaker State.) You can find the data at our site as a plain text file and as an Excel spreadsheet. Download the text file now and save it to the directory where you installed R.
R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. To use the file you just downloaded in R you must define a variable to be equal to the contents of this file.
donuts <- read.table("donuts.txt",header=TRUE)
The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. You can now get a table of contents for what you have created in R with
This should return donuts along with any other variables you may have created. You will not see on this list any of the variables that are inside of donuts because they are hiding. To see them, type
> names(donuts)  "Fat1" "Fat2" "Fat3" "Fat4"
To bring them out of hiding, you must attach them to your R workspace.
Then you can work with them providing you remember that R is case-sensitive.
> Fat1  164 172 168 177 156 195
ANOVA is commonly used with experimental studies and that is the case here. The experiment consists of frying some donuts in each of four fats. Twenty-four batches of donuts were prepared and six randomly assigned to each of the four fats. The results, in grams of fat absorbed for each batch, and as they might commonly be laid out on a page (and are laid out in the file) were:
While this is a reasonable arrangement for purposes of page layout, and very common for this type of data, it obscures the structure of the data and may confuse statistical software. The experimental units here are batches of donuts, and for each batch we write down two things: one value for the quantitative variable fat absorbed and one value for the qualitative variable type of fat. Here is how R can change the page layout format into a format that is more logical and easier for statistical software to deal with.
> sdonuts <- stack(donuts) > sdonuts values ind 1 164 Fat1 2 172 Fat1 3 168 Fat1 4 177 Fat1 5 156 Fat1 6 195 Fat1 7 178 Fat2 8 191 Fat2 9 197 Fat2 10 182 Fat2 11 185 Fat2 12 177 Fat2 13 175 Fat3 14 193 Fat3 15 178 Fat3 16 171 Fat3 17 163 Fat3 18 176 Fat3 19 155 Fat4 20 166 Fat4 21 149 Fat4 22 164 Fat4 23 170 Fat4 24 168 Fat4
We can compare the four fats by looking at summary statistics or at parallel boxplots.
> summary(donuts) Fat1 Fat2 Fat3 Fat4 Min. :156.0 Min. :177.0 Min. :163.0 Min. :149.0 1st Qu.:165.0 1st Qu.:179.0 1st Qu.:172.0 1st Qu.:157.3 Median :170.0 Median :183.5 Median :175.5 Median :165.0 Mean :172.0 Mean :185.0 Mean :176.0 Mean :162.0 3rd Qu.:175.8 3rd Qu.:189.5 3rd Qu.:177.5 3rd Qu.:167.5 Max. :195.0 Max. :197.0 Max. :193.0 Max. :170.0 > summary(sdonuts) values ind Min. :149.0 Fat1:6 1st Qu.:165.5 Fat2:6 Median :173.5 Fat3:6 Mean :173.8 Fat4:6 3rd Qu.:179.0 Max. :197.0 > attach(sdonuts) > boxplot(values ~ ind)
It certainly looks like more of Fat 2 gets absorbed while Fat 4 seems least absorbed. But wait a minute! If we repeated the experiment we would most likely get different numbers. Could this change the rankings of the fats? Is it possible that all four fats are absorbed to about the same degree and we are just seeing random fluctuations from one assignment of batches to fats to another? To see if that is likely we do a hypothesis test. The null as usual is backwards: we hypothesize no difference among the fats. As always, the null provides a specific model with which we can play "what if". If the null were true, would such differences be ordinary or extraordinary?
> oneway.test(values ~ ind, var.equal=TRUE) One-way analysis of means data: values and ind F = 5.4063, num df = 3, denom df = 20, p-value = 0.006876
(The code var.equal=TRUE gives us the usual ANOVA test covered in textbooks.) The p-value of 0.006876 is for a test of the hypothesis that the mean amount of fat absorbed is the same for all four types of fat. Because it is so small, we reject the hypothesis of equal absorption.
Like any statistical test, this one is based on some assumptions. We will only mention the ones we can check with software. These are two: that the numbers for each fat are normally distributed and that they share a common variance. We can check these roughly from the boxplots. There we see roughly similar spreads and no serious departures from normality.
If we see signs the assumptions are not met then the remedies are similar to what they are in the univariate case. For example, outliers or bimodality must be investigated as to their cause. A transformation of the dependent variable may help just as it can in the univariate case. However, it is most likely to be effective if all the groups are skewed, and in the same direction, or if there is a systematic change in variability with amount of fat absorbed.
*Our source is Chapter 12 of Snedecor and Cochran, Statistical Methods (7th. ed.), 1980, Iowa State University Press, Ames, IA.
© 2007 Robert W. Hayden