Go to the R home page http://www.R-Project.org and download and install the software for your platform. Information on doing this can be found at the website.

For our first example, we will work with this data on annual rainfall in inches for various cities throughout the world.

Algiers 30 Lagos 72 Athens 16 La Paz 23 Beirut 35 Lima 2 Berlin 23 London 23 Bogota 42 Madrid 17 Bombay 71 Moscow 25 Cairo 1 Oslo 27 Dublin 30 Paris 22 Geneva 34 Rome 30 Havana 48 Vienna 26

We will enter the numbers using the R function "c". The
standard file format for statistical data is that each column is a
variable and each row is a case. Here the variable is rainfall and the
cases are the cities. Sometimes I like to think of the name of the "c"
function as representing "*c*olumn". You can use it to
enter a column of data, i.e., a single variable. Use it like this: at the
R prompt ">" type

`rainfall = c(16, 35, 23,...,26)`

Of course, you must type in the rest of the data where I typed "...".Hit
RETURN at the end of the line. Nothing happens. If in doubt, R is silent.
To check to see if you succeeded, just type `rainfall` at the R
prompt. R should tell you what is in the variable "rainfall".

> rainfall [1] 30 16 35 23 42 71 1 30 34 48 72 23 2 23 17 25 27 22 30 26

Ignore the [1] for now. There are many slick ways to get data into R but for now just typing it in will do. Another option is a simple data editor available in some versions of R. Type

`data.entry(rainfall)`

to see if your version includes this feature. If it does, this is a good time to edit any typos in your data entry. The following worked in the Windows version. Double click on a cell to edit it. Hit RETURN when done with that cell. When done with all cells, right click on the Data Editor and chose Close. The Data Editor only edits data already entered into R. You can trick it into creating a new column of data. Let's say we also have snowfall data. Type

`snowfall = c(1)`

`data.entry(snowfall)`

The "1" in the first command is just a placeholder. Any number will do. When the Data Editor opens, replace the 1 with actual data.

Once you have the data in R, you can get a variety of summary statistics and displays. Try some of the following.

> mean(rainfall) [1] 29.85 > median(rainfall) [1] 26.5

If you get different numbers, proofread your data for typos. If you do not have a Data Editor, you can fix one number at a time from the command line. Let's say that for the sixth city, Bombay, you typed 17 instead of 71.

`rainfall[6]=71`

will fix this.

> mode(rainfall) [1] "numeric"

This is probably not what we had in mind for the mode. R is telling us
that numerical data is what is stored in `rainfall`. If you really
want the mode, the `table` command will work for data sets for
which a mode is reasonable.

> table(rainfall) rainfall 1 2 16 17 22 23 25 26 27 30 34 35 42 48 71 72 1 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1

This tells us 23 and 30 are tied for mode with three occurrences each.

> sd(rainfall) [1] 18.07084 > max(rainfall) [1] 72 > min(rainfall) [1] 1 > range(rainfall) [1] 1 72

Oh, my. I work so hard to convince my students that the range is *one*
number! Use `diff` on the output of the previous command.

> diff(range(rainfall)) [1] 71 > fivenum(rainfall) [1] 1.0 22.5 26.5 34.5 72.0 > lentgth(rainfall) Error: couldn't find function "lentgth" > length(rainfall) [1] 20

So, the range is 71. Many people include a sixth number in the five
number summary: the number of observations, *n*. This is returned by
the `length` function in R. (If you mistype something, R will give
an error message. Most are much more cryptic than this one.) We know that
different textbooks use different definitions of the second and fourth
numbers in the five number summary. A simple test distinguishes between
Tukey's original definition and the Moore/QLP definition:

> fivenum(c(1,2,3,4,5)) [1] 1 2 3 4 5

Tukey wins!-)

> stem(rainfall) The decimal point is 1 digit(s) to the right of the | 0 | 1267 2 | 233356700045 4 | 28 6 | 12

I believe that every software lesson should include analysis of a dataset that shows how the software can actually be used to find out something useful about the data. Here we will also learn something useful about stem and leaf plots. They can be made on a variety of scales. The one chosen by R is a bit odd. If we type

> ?stem

we get help on the stem command. (How help appears depends on the platform. In CP/M, it arrives in the form of cavalry.) In this case, the help included this information:

'stem' produces a stem-and-leaf plot of the values in 'x'. The parameter 'scale' can be used to expand the scale of the plot. A value of 'scale=2' will cause the plot to be roughly twice as long as the default.

Let's try it!

> stem(rainfall, scale=0.5) The decimal point is 2 digit(s) to the right of the | 0 | 00222222333333344 0 | 577 > stem(rainfall, scale=2) The decimal point is 1 digit(s) to the right of the | 0 | 12 1 | 67 2 | 2333567 3 | 00045 4 | 28 5 | 6 | 7 | 12 > stem(rainfall, scale=5) The decimal point is 1 digit(s) to the right of the | 0 | 12 0 | 1 | 1 | 67 2 | 2333 2 | 567 3 | 0004 3 | 5 4 | 2 4 | 8 5 | 5 | 6 | 6 | 7 | 12 > stem(rainfall, scale=10) The decimal point is at the | 0 | 0 2 | 0 4 | 6 | 8 | 10 | 12 | 14 | 16 | 00 18 | 20 | 22 | 0000 24 | 0 26 | 00 28 | 30 | 000 32 | 34 | 00 36 | 38 | 40 | 42 | 0 44 | 46 | 48 | 0 50 | 52 | 54 | 56 | 58 | 60 | 62 | 64 | 66 | 68 | 70 | 0 72 | 0

The stem and leaf with `scale`=0.5 is also odd. `scale`=2
is probably what you or your students would have done. It is better than
the previous two scales because it reveals the two potential high
outliers. Using `scale`=5 is even better because it shows the two
potential* low* outliers. Are they really outliers? On my R system

> boxplot(rainfall)

caused a boxplot to pop up in a graphics window.

For data such as this, which appears skewed toward high values, a transformation is often appropriate.

> stem(log(rainfall)) The decimal point is at the | 0 | 07 1 | 2 | 88 3 | 11112334445679 4 | 33 > stem(log(rainfall), scale=2) The decimal point is at the | 0 | 0 0 | 7 1 | 1 | 2 | 2 | 88 3 | 1111233444 3 | 5679 4 | 33

On a logarithmic scale, only the low outliers appear!

If you were in international agribusiness, which rainfall would be more of an outlier for your purposes, one inch per year or 72 inches per year?

You can also get other displays from R. Here is something akin to what I would call a dotplot:

> stripchart(rainfall, method = "stack")

> hist(rainfall)

If we were to examine this data by hand, we might be tempted to do a single display simply because making more is extra work. A stem and leaf is a good choice if we are doing things by hand. Were we to chose the same scale R did, we would not have learned much about this data. However, using the computer we were able to change the scale and select the best scale for our data. In the process we discovered two high outliers and then two low outliers. A boxplot, a logarithmic transformation, and thought about the meaning of the data all confirmed the significance of the low outliers.

When you leave R, you will be asked if you want to `Save Workspace
Image`. *You do!* This will mean that all the data you worked
hard to type in or import will be there waiting for you the next time you
start R.

© 2006 Robert W. Hayden