Go to the R home page http://www.R-Project.org and download and install the software for your platform. Information on doing this can be found at the website.
For our first example, we will work with this data on annual rainfall in inches for various cities throughout the world.
Algiers 30 Lagos 72 Athens 16 La Paz 23 Beirut 35 Lima 2 Berlin 23 London 23 Bogota 42 Madrid 17 Bombay 71 Moscow 25 Cairo 1 Oslo 27 Dublin 30 Paris 22 Geneva 34 Rome 30 Havana 48 Vienna 26
We will enter the numbers using the R function "c". The standard file format for statistical data is that each column is a variable and each row is a case. Here the variable is rainfall and the cases are the cities. Sometimes I like to think of the name of the "c" function as representing "column". You can use it to enter a column of data, i.e., a single variable. Use it like this: at the R prompt ">" type
rainfall = c(16, 35, 23,...,26)
Of course, you must type in the rest of the data where I typed "...".Hit RETURN at the end of the line. Nothing happens. If in doubt, R is silent. To check to see if you succeeded, just type rainfall at the R prompt. R should tell you what is in the variable "rainfall".
> rainfall  30 16 35 23 42 71 1 30 34 48 72 23 2 23 17 25 27 22 30 26
Ignore the  for now. There are many slick ways to get data into R but for now just typing it in will do. Another option is a simple data editor available in some versions of R. Type
to see if your version includes this feature. If it does, this is a good time to edit any typos in your data entry. The following worked in the Windows version. Double click on a cell to edit it. Hit RETURN when done with that cell. When done with all cells, right click on the Data Editor and chose Close. The Data Editor only edits data already entered into R. You can trick it into creating a new column of data. Let's say we also have snowfall data. Type
snowfall = c(1)
The "1" in the first command is just a placeholder. Any number will do. When the Data Editor opens, replace the 1 with actual data.
Once you have the data in R, you can get a variety of summary statistics and displays. Try some of the following.
> mean(rainfall)  29.85 > median(rainfall)  26.5
If you get different numbers, proofread your data for typos. If you do not have a Data Editor, you can fix one number at a time from the command line. Let's say that for the sixth city, Bombay, you typed 17 instead of 71.
will fix this.
> mode(rainfall)  "numeric"
This is probably not what we had in mind for the mode. R is telling us that numerical data is what is stored in rainfall. If you really want the mode, the table command will work for data sets for which a mode is reasonable.
> table(rainfall) rainfall 1 2 16 17 22 23 25 26 27 30 34 35 42 48 71 72 1 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1
This tells us 23 and 30 are tied for mode with three occurrences each.
> sd(rainfall)  18.07084 > max(rainfall)  72 > min(rainfall)  1 > range(rainfall)  1 72
Oh, my. I work so hard to convince my students that the range is one number! Use diff on the output of the previous command.
> diff(range(rainfall))  71 > fivenum(rainfall)  1.0 22.5 26.5 34.5 72.0 > lentgth(rainfall) Error: couldn't find function "lentgth" > length(rainfall)  20
So, the range is 71. Many people include a sixth number in the five number summary: the number of observations, n. This is returned by the length function in R. (If you mistype something, R will give an error message. Most are much more cryptic than this one.) We know that different textbooks use different definitions of the second and fourth numbers in the five number summary. A simple test distinguishes between Tukey's original definition and the Moore/QLP definition:
> fivenum(c(1,2,3,4,5))  1 2 3 4 5
> stem(rainfall) The decimal point is 1 digit(s) to the right of the | 0 | 1267 2 | 233356700045 4 | 28 6 | 12
I believe that every software lesson should include analysis of a dataset that shows how the software can actually be used to find out something useful about the data. Here we will also learn something useful about stem and leaf plots. They can be made on a variety of scales. The one chosen by R is a bit odd. If we type
we get help on the stem command. (How help appears depends on the platform. In CP/M, it arrives in the form of cavalry.) In this case, the help included this information:
'stem' produces a stem-and-leaf plot of the values in 'x'. The parameter 'scale' can be used to expand the scale of the plot. A value of 'scale=2' will cause the plot to be roughly twice as long as the default.
> stem(rainfall, scale=0.5) The decimal point is 2 digit(s) to the right of the | 0 | 00222222333333344 0 | 577 > stem(rainfall, scale=2) The decimal point is 1 digit(s) to the right of the | 0 | 12 1 | 67 2 | 2333567 3 | 00045 4 | 28 5 | 6 | 7 | 12 > stem(rainfall, scale=5) The decimal point is 1 digit(s) to the right of the | 0 | 12 0 | 1 | 1 | 67 2 | 2333 2 | 567 3 | 0004 3 | 5 4 | 2 4 | 8 5 | 5 | 6 | 6 | 7 | 12 > stem(rainfall, scale=10) The decimal point is at the | 0 | 0 2 | 0 4 | 6 | 8 | 10 | 12 | 14 | 16 | 00 18 | 20 | 22 | 0000 24 | 0 26 | 00 28 | 30 | 000 32 | 34 | 00 36 | 38 | 40 | 42 | 0 44 | 46 | 48 | 0 50 | 52 | 54 | 56 | 58 | 60 | 62 | 64 | 66 | 68 | 70 | 0 72 | 0
The stem and leaf with scale=0.5 is also odd. scale=2 is probably what you or your students would have done. It is better than the previous two scales because it reveals the two potential high outliers. Using scale=5 is even better because it shows the two potential low outliers. Are they really outliers? On my R system
caused a boxplot to pop up in a graphics window.
For data such as this, which appears skewed toward high values, a transformation is often appropriate.
> stem(log(rainfall)) The decimal point is at the | 0 | 07 1 | 2 | 88 3 | 11112334445679 4 | 33 > stem(log(rainfall), scale=2) The decimal point is at the | 0 | 0 0 | 7 1 | 1 | 2 | 2 | 88 3 | 1111233444 3 | 5679 4 | 33
On a logarithmic scale, only the low outliers appear!
If you were in international agribusiness, which rainfall would be more of an outlier for your purposes, one inch per year or 72 inches per year?
You can also get other displays from R Here is something akin to what I would call a dotplot:
> stripchart(rainfall, method = "stack")
If we were to examine this data by hand, we might be tempted to do a single display simply because making more is extra work. A stem and leaf is a good choice if we are doing things by hand. Were we to chose the same scale R did, we would not have learned much about this data. However, using the computer we were able to change the scale and select the best scale for our data. In the process we discovered two high outliers and then two low outliers. A boxplot, a logarithmic transformation, and thought about the meaning of the data all confirmed the significance of the low outliers.