Running for R: Quantitative Data

This page uses the PULSE data. If you are not already familiar with it, you should read a description of the data. You can find the entire dataset at our site as a plain text file or as an Excel spreadsheet. Download the text version and save it in the directory where you installed R. After that, run R and type

> PULSE = read.table("pulse.txt",header=TRUE)

at the R prompt. The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. (These should not include spaces.) You can now get a table of contents for what you have created in R with

> objects() 

This should return PULSE along with any other objects you may have created. You will not see on this list any of the variables that are inside of PULSE because they are hiding. To see them, type

> names(PULSE)
[1] "PuBefor"   "PuAfter"   "Ran."      "Smokes."   "Sex"       "Height"   
[7] "Weight"    "ActivityL"

To bring them out of hiding, you must "attach" them to your R workspace. (This avoids conflicts if several tables include variables with the same name. Attach just one at a time.)

> attach(PULSE)

You can get an assortment of summary statistics for all your variables by using the summary command.

> summary(PULSE)
    PuBefore         PuAfter     Ran.    Smokes.      Sex         Height     
 Min.   : 48.00   Min.   : 50   no :57   no :64   female:35   Min.   :61.00  
 1st Qu.: 64.00   1st Qu.: 68   yes:35   yes:28   male  :57   1st Qu.:66.00  
 Median : 71.00   Median : 76                                 Median :69.00  
 Mean   : 72.87   Mean   : 80                                 Mean   :68.72  
 3rd Qu.: 80.00   3rd Qu.: 85                                 3rd Qu.:72.00  
 Max.   :100.00   Max.   :140                                 Max.   :75.00  
     Weight        ActivityL    
 Min.   : 95.0   Min.   :0.000  
 1st Qu.:125.0   1st Qu.:2.000  
 Median :145.0   Median :2.000  
 Mean   :145.2   Mean   :2.109  
 3rd Qu.:155.5   3rd Qu.:2.000  
 Max.   :215.0   Max.   :3.000 

Note that for the summaries reported above, R gives different summaries depending on whether the column contains numbers or text. This seems right except possibly for the ActivityL variable which is an ordered category and so might better be treated as qualitative data. For now, just remember that anything in a data file that looks like a number to R will be treated as such and you may need to take special steps if you have a column of numbers that are actually labels -- medical diagnosis codes, for example.

The one summary you might miss from the above is the standard deviation. You can easily get that for the variable of your choice.

sd(PuAfter)
[1] 17.09379

We can compare the before and after pulse rates with parallel boxplots. R wants to see the data in a format that has all the measurements (both before and after) in one column with another column labelling the two sets of data. The R function c (for concatenate) combines a bunch of things into a single thing.

> rate = c(PuBefor,PuAfter)
> rate
  [1]  48  54  54  58  58  58  60  60  60  60  61  62  62  62  62  62  62  62
 [19]  62  62  64  64  64  64  66  66  66  66  66  68  68  68  68  68  68  68
 [37]  68  68  68  68  70  70  70  70  70  70  72  72  72  72  72  72  74  74
 [55]  74  74  74  76  76  76  76  76  78  78  78  78  78  80  80  80  82  82
 [73]  82  84  84  84  84  86  87  88  88  88  90  90  90  90  92  92  94  96
 [91]  96 100  54  56  50  70  58  56  76  62  70  66  70  76  75  58 100  98
[109]  62  66  68  66  88  80  62  60  78  82 102  72  76  72  76  76 112  66
[127]  68  64  68  66  68  68  72 106  94  62  70  66  80  74  74  70  70  68
[145]  84  76  70  74  76 118  76  76  76  76 104 118  76  78  80  96 128  74
[163] 100  84  80  84  84  84  80  84  84 110  84  74  94  88  90  92  84  94
[181]  92 140 116 115

We used the name rate for the stacked variable. We also need a variable to keep track of which group each measurement came from. The R rep (for repeat) function is useful for generating repetitive data.

> B = rep("Before",92)
> A = rep("After",92)
> BA = c(B,A)
> BA
  [1] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
  [9] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [17] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [25] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [33] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [41] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [49] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [57] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [65] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [73] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [81] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
 [89] "Before" "Before" "Before" "Before" "After"  "After"  "After"  "After" 
 [97] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[105] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[113] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[121] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[129] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[137] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[145] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[153] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[161] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[169] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
[177] "After"  "After"  "After"  "After"  "After"  "After"  "After"  "After" 
>

This creates a variable B that is just 92 instances of the word "Before" and similarly for A. Then the two are concatenated into BA, the categorical variable that keeps track of before and after. Now you can type

> boxplot(rate ~ BA)

to get boxplots. boxplot1

The tilde "~" is used often in R. Here you can think of it as saying we are going to see how rate depends on BA. You can see that the "after" data show a higher median and more variability along with high outliers not present in the "before" data. The fact that both center and variability have changed makes it hard to give a simple comparison here. If we looked only at the mean, we might say that pulse rates went up by about seven points, and they did on average. However, the lowest rates have not gone up much at all (The minimum went up by 2.) while the highest have changed considerably (The maximum went up by 40!). A simple average change does not describe what happened here very accurately, which is one reason we always need to make a picture!

Because the class contained both men and women, we might expect some bimodality in the heights and weights. Stem and leaf plots are R's default tool for assessing the shape of a distribution. (Boxplots tend to hide bimodality.)

> stem(Height)

  The decimal point is at the |

  60 | 08
  62 | 000080000
  64 | 0000005
  66 | 000000000000000
  68 | 000000000000000000005
  70 | 0000000000005
  72 | 00000000000000055
  74 | 00000000

> stem(Weight)

  The decimal point is 1 digit(s) to the right of the |

   8 | 5
  10 | 288002556688
  12 | 000123555550000013555688
  14 | 000025555580000000000355555555557
  16 | 000045000055
  18 | 000500005
  20 | 5

Are the Heights trimodal or is that just random scatter? The weights look unimodal but looking at our data on a single scale is rarely enough.

> stem(Weight, scale=2)

  The decimal point is 1 digit(s) to the right of the |

   9 | 5
  10 | 288
  11 | 002556688
  12 | 00012355555
  13 | 0000013555688
  14 | 00002555558
  15 | 0000000000355555555557
  16 | 000045
  17 | 000055
  18 | 0005
  19 | 00005
  20 | 
  21 | 5
      

Now make boxplots as you did above but with Sex as the factor and Height as the measurement.

> boxplot(Height ~ Sex)
      
boxplot2

Both sexes show reasonably compact, symmetric distributions with no outliers but the men are consistently about 5 inches taller than the women. Because the two distributions have similar shapes and variabilities, we can reasonably say that the men as a group are about 5 inches taller than the women. Compare this to the situation with the pulse rates above, where such a simple description was an oversimplification. Note that although the two sexes are clearly different here, there is enough overlap that the stem and leaf does not clearly show the two groups.

We saw in the earlier boxplots of the pulse rates that the after rates were skewed toward high values with several possible outliers. When we see most of the alleged outliers on one side of a boxplot it may well be that we have a skewness problem rather than an outlier problem. Another sign is outliers that start close to the whiskers and gradually thin out. Another is a boxplot that looks skewed in the direction of the outliers even if we erase them. It is important to recognize these two different situations because the remedies are quite different. If we have an outlier problem, we need to investigate the individual points to see if we can find a reason for their unusual behavior. If we have a skewness problem we might use the median rather than the mean to describe the center, or we might re-express (transform) the data to make it more symmetric. Here are a number of common transformations and their effect on the after pulse readings. First, here is a stem and leaf of the original data.

    5* | 04
    5. | 6688
    6* | 022224
    6. | 666666888888
    7* | 000000022244444
    7. | 566666666666688
    8* | 000002444444444
    8. | 88
    9* | 022444
    9. | 68
   10* | 0024
   10. | 6
   11* | 02
   11. | 5688
   12* | 
   12. | 8
   13* | 
   13. | 
   14* | 0

Let's see if a square root transformation makes this less skewed.

> SquareRoot = sqrt(PuAfter)
> stem(SquareRoot)

  The decimal point is at the |

   7 | 13
   7 | 556679999
   8 | 01111112222224444444
   8 | 5556666677777777777778899999
   9 | 122222222244
   9 | 56677789
  10 | 00123
  10 | 567899
  11 | 3
  11 | 8

This is better. Repeat with

ln = log(PuAfter)
      

to get natural logarithms (base e).

> ln = log(PuAfter)
> stem(ln)

  The decimal point is 1 digit(s) to the left of the |

  38 | 19
  40 | 3366933336999999
  42 | 22222255555558880000023333333333336688888
  44 | 13333333338802244468
  46 | 11246024577
  48 | 54

> stem(ln, scale=2)

  The decimal point is 1 digit(s) to the left of the |

  39 | 19
  40 | 33669
  41 | 33336999999
  42 | 2222225555555888
  43 | 0000023333333333336688888
  44 | 133333333388
  45 | 02244468
  46 | 11246
  47 | 024577
  48 | 5
  49 | 4

Going from the original data to the square roots to the logarithms there is successively less skewness. There is still some skewness present in the logs and so we will try two stronger transformations: negative reciprocal roots and negative reciprocals. The reason for the negatives here is that when you take reciprocals you reverse up and down in the graph. This is a nuisance when comparing several graphs, so we introduce the minus sign to switch things back again. For most purposes, however, we do not do that. For example, if you measure fuel economy in miles per gallon and take the reciprocal you get gallons per mile -- a perfectly good, but different, way to measure fuel consumption. No one would normally put a negative sign in front of either measurement. However, we do have to mentally remember that high MPG is good while high GPM is bad, even though in some sense they measure the same thing.

You can use the techniques described above to get these transformations.

> nrr = -1/sqrt(PuAfter)
> stem(nrr)

  The decimal point is 2 digit(s) to the left of the |

  -14 | 1
  -13 | 64411
  -12 | 9777753333331111110000000
  -11 | 88866666555555555555533222220
  -10 | 999999999775443332100
   -9 | 987543322
   -8 | 85

> stem(nrr,scale=2)

  The decimal point is 2 digit(s) to the left of the |

  -14 | 1
  -13 | 6
  -13 | 4411
  -12 | 977775
  -12 | 3333331111110000000
  -11 | 888666665555555555555
  -11 | 33222220
  -10 | 999999999775
  -10 | 443332100
   -9 | 9875
   -9 | 43322
   -8 | 85

> nr = -1/PuAfter
> stem(nr)

  The decimal point is 3 digit(s) to the left of the |

  -20 | 0
  -18 | 5
  -16 | 992271111
  -14 | 62222227777773333333
  -12 | 99955555322222222222288555552
  -10 | 999999999441996664200
   -8 | 864197655
   -6 | 81

> stem(nr, scale=2)

  The decimal point is 3 digit(s) to the left of the |

  -20 | 0
  -19 | 
  -18 | 5
  -17 | 9922
  -16 | 71111
  -15 | 6222222
  -14 | 7777773333333
  -13 | 999555553222222222222
  -12 | 88555552
  -11 | 999999999441
  -10 | 996664200
   -9 | 8641
   -8 | 97655
   -7 | 81

Here four common transformations were applied in order of increasing strength for reducing skewness toward high vales: square roots, logarithms (here to base e), negative reciprocal roots, and negative reciprocals. The last two transformations seem to have made the distribution most symmetric.


©2008 Robert W. Hayden