Getting Started with R

Go to the R home page http://www.R-Project.org and download and install the software for your platform. Information on doing this can be found at the website.

For our first example, we will work with this data on annual rainfall in inches for various cities throughout the world.

		    Algiers   30    Lagos     72
		    Athens    16    La Paz    23
		    Beirut    35    Lima       2
		    Berlin    23    London    23
		    Bogota    42    Madrid    17
		    Bombay    71    Moscow    25
		    Cairo      1    Oslo      27
		    Dublin    30    Paris     22
		    Geneva    34    Rome      30
		    Havana    48    Vienna    26

We will enter the numbers using the R function "c". The standard file format for statistical data is that each column is a variable and each row is a case. Here the variable is rainfall and the cases are the cities. Sometimes I like to think of the name of the "c" function as representing "column". You can use it to enter a column of data, i.e., a single variable. Use it like this: at the R prompt ">" type

rainfall = c(16, 35, 23,...,26)

Of course, you must type in the rest of the data where I typed "...".Hit RETURN at the end of the line. Nothing happens. If in doubt, R is silent. To check to see if you succeeded, just type rainfall at the R prompt. R should tell you what is in the variable "rainfall".

> rainfall 
[1] 30 16 35 23 42 71 1 30 34 48 72 23 2 23 17 25 27 22 30 26

Ignore the [1] for now. There are many slick ways to get data into R but for now just typing it in will do. Another option is a simple data editor available in some versions of R. Type

data.entry(rainfall)

to see if your version includes this feature. If it does, this is a good time to edit any typos in your data entry. The following worked in the Windows version. Double click on a cell to edit it. Hit RETURN when done with that cell. When done with all cells, right click on the Data Editor and chose Close. The Data Editor only edits data already entered into R. You can trick it into creating a new column of data. Let's say we also have snowfall data. Type

snowfall = c(1)

data.entry(snowfall)

The "1" in the first command is just a placeholder. Any number will do. When the Data Editor opens, replace the 1 with actual data.

Once you have the data in R, you can get a variety of summary statistics and displays. Try some of the following.

> mean(rainfall)
[1] 29.85
> median(rainfall)
[1] 26.5

If you get different numbers, proofread your data for typos. If you do not have a Data Editor, you can fix one number at a time from the command line. Let's say that for the sixth city, Bombay, you typed 17 instead of 71.

rainfall[6]=71

will fix this.

> mode(rainfall)
[1] "numeric"

This is probably not what we had in mind for the mode. R is telling us that numerical data is what is stored in rainfall. If you really want the mode, the table command will work for data sets for which a mode is reasonable.

>  table(rainfall)
rainfall
 1  2 16 17 22 23 25 26 27 30 34 35 42 48 71 72 
 1  1  1  1  1  3  1  1  1  3  1  1  1  1  1  1 

This tells us 23 and 30 are tied for mode with three occurrences each.

> sd(rainfall)
[1] 18.07084
> max(rainfall)
[1] 72
> min(rainfall)
[1] 1
> range(rainfall)
[1]  1 72

Oh, my. I work so hard to convince my students that the range is one number! Use diff on the output of the previous command.

> diff(range(rainfall))
[1] 71
> fivenum(rainfall)
[1]  1.0 22.5 26.5 34.5 72.0
> lentgth(rainfall)
Error: couldn't find function "lentgth"
> length(rainfall)
[1] 20

So, the range is 71. Many people include a sixth number in the five number summary: the number of observations, n. This is returned by the length function in R. (If you mistype something, R will give an error message. Most are much more cryptic than this one.) We know that different textbooks use different definitions of the second and fourth numbers in the five number summary. A simple test distinguishes between Tukey's original definition and the Moore/QLP definition:

> fivenum(c(1,2,3,4,5))
[1] 1 2 3 4 5

Tukey wins!-)

> stem(rainfall)

  The decimal point is 1 digit(s) to the right of the |

  0 | 1267
  2 | 233356700045
  4 | 28
  6 | 12

I believe that every software lesson should include analysis of a dataset that shows how the software can actually be used to find out something useful about the data. Here we will also learn something useful about stem and leaf plots. They can be made on a variety of scales. The one chosen by R is a bit odd. If we type

> ?stem

we get help on the stem command. (How help appears depends on the platform. In CP/M, it arrives in the form of cavalry.) In this case, the help included this information:

     'stem' produces a stem-and-leaf plot of the values in 'x'. The
     parameter 'scale' can be used to expand the scale of the plot.  A
     value of 'scale=2' will cause the plot to be roughly twice as long
     as the default.

Let's try it!

> stem(rainfall, scale=0.5)

  The decimal point is 2 digit(s) to the right of the |

  0 | 00222222333333344
  0 | 577

> stem(rainfall, scale=2)

  The decimal point is 1 digit(s) to the right of the |

  0 | 12
  1 | 67
  2 | 2333567
  3 | 00045
  4 | 28
  5 | 
  6 | 
  7 | 12

> stem(rainfall, scale=5)

  The decimal point is 1 digit(s) to the right of the |

  0 | 12
  0 | 
  1 | 
  1 | 67
  2 | 2333
  2 | 567
  3 | 0004
  3 | 5
  4 | 2
  4 | 8
  5 | 
  5 | 
  6 | 
  6 | 
  7 | 12

> stem(rainfall, scale=10)

  The decimal point is at the |

   0 | 0
   2 | 0
   4 | 
   6 | 
   8 | 
  10 | 
  12 | 
  14 | 
  16 | 00
  18 | 
  20 | 
  22 | 0000
  24 | 0
  26 | 00
  28 | 
  30 | 000
  32 | 
  34 | 00
  36 | 
  38 | 
  40 | 
  42 | 0
  44 | 
  46 | 
  48 | 0
  50 | 
  52 | 
  54 | 
  56 | 
  58 | 
  60 | 
  62 | 
  64 | 
  66 | 
  68 | 
  70 | 0
  72 | 0

The stem and leaf with scale=0.5 is also odd. scale=2 is probably what you or your students would have done. It is better than the previous two scales because it reveals the two potential high outliers. Using scale=5 is even better because it shows the two potential low outliers. Are they really outliers? On my R system

> boxplot(rainfall)

caused a boxplot to pop up in a graphics window.

{short description of image}

For data such as this, which appears skewed toward high values, a transformation is often appropriate.

> stem(log(rainfall))

  The decimal point is at the |

  0 | 07
  1 | 
  2 | 88
  3 | 11112334445679
  4 | 33

> stem(log(rainfall), scale=2)

  The decimal point is at the |

  0 | 0
  0 | 7
  1 | 
  1 | 
  2 | 
  2 | 88
  3 | 1111233444
  3 | 5679
  4 | 33  

On a logarithmic scale, only the low outliers appear!

Discussion Topic

If you were in international agribusiness, which rainfall would be more of an outlier for your purposes, one inch per year or 72 inches per year?

You can also get other displays from R. Here is something akin to what I would call a dotplot:

> stripchart(rainfall, method = "stack")

{short description of image}

> hist(rainfall)  

{short description of image}

Moral

If we were to examine this data by hand, we might be tempted to do a single display simply because making more is extra work. A stem and leaf is a good choice if we are doing things by hand. Were we to chose the same scale R did, we would not have learned much about this data. However, using the computer we were able to change the scale and select the best scale for our data. In the process we discovered two high outliers and then two low outliers. A boxplot, a logarithmic transformation, and thought about the meaning of the data all confirmed the significance of the low outliers.

Save Workspace Image?

When you leave R, you will be asked if you want to Save Workspace Image. You do! This will mean that all the data you worked hard to type in or import will be there waiting for you the next time you start R.


© 2006 Robert W. Hayden