This page uses the PULSE data. If you are not already familiar with it, you should read a description of the data. You can find the entire dataset at our site as a plain text file or as an Excel spreadsheet. Download the text version and save it in the directory where you installed R. After that, run R and type

> PULSE = read.table("pulse.txt",header=TRUE)

at the R prompt. The argument `header=TRUE` tells R that the
first row of the file should be interpreted as variable names. (These
should not include spaces.) You can now get a table of contents for what
you have created in R with

> objects()

This should return `PULSE` along with any other objects you may
have created. You will not see on this list any of the variables that are
inside of `PULSE` because they are hiding. To see them, type

> names(PULSE) [1] "PuBefor" "PuAfter" "Ran." "Smokes." "Sex" "Height" [7] "Weight" "ActivityL"

To bring them out of hiding, you must "attach" them to your R workspace. (This avoids conflicts if several tables include variables with the same name. Attach just one at a time.)

> attach(PULSE)

You can get an assortment of summary statistics for all your variables
by using the `summary` command.

> summary(PULSE) PuBefore PuAfter Ran. Smokes. Sex Height Min. : 48.00 Min. : 50 no :57 no :64 female:35 Min. :61.00 1st Qu.: 64.00 1st Qu.: 68 yes:35 yes:28 male :57 1st Qu.:66.00 Median : 71.00 Median : 76 Median :69.00 Mean : 72.87 Mean : 80 Mean :68.72 3rd Qu.: 80.00 3rd Qu.: 85 3rd Qu.:72.00 Max. :100.00 Max. :140 Max. :75.00 Weight ActivityL Min. : 95.0 Min. :0.000 1st Qu.:125.0 1st Qu.:2.000 Median :145.0 Median :2.000 Mean :145.2 Mean :2.109 3rd Qu.:155.5 3rd Qu.:2.000 Max. :215.0 Max. :3.000

Note that for the summaries reported above, R gives different summaries
depending on whether the column contains numbers or text. This seems right
except possibly for the `ActivityL` variable which is an ordered
category and so might better be treated as qualitative data. For now, just
remember that anything in a data file that looks like a number to R will
be treated as such and you may need to take special steps if you have a
column of numbers that are actually labels -- medical diagnosis codes, for
example.

The one summary you might miss from the above is the standard deviation. You can easily get that for the variable of your choice.

sd(PuAfter) [1] 17.09379

We can compare the before and after pulse rates with parallel boxplots.
R wants to see the data in a format that has all the measurements (both
before and after) in one column with another column labelling the two sets
of data. The R function `c` (for concatenate) combines a bunch of
things into a single thing.

> rate = c(PuBefor,PuAfter) > rate [1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 [19] 62 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 [37] 68 68 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 [55] 74 74 74 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 [73] 82 84 84 84 84 86 87 88 88 88 90 90 90 90 92 92 94 96 [91] 96 100 54 56 50 70 58 56 76 62 70 66 70 76 75 58 100 98 [109] 62 66 68 66 88 80 62 60 78 82 102 72 76 72 76 76 112 66 [127] 68 64 68 66 68 68 72 106 94 62 70 66 80 74 74 70 70 68 [145] 84 76 70 74 76 118 76 76 76 76 104 118 76 78 80 96 128 74 [163] 100 84 80 84 84 84 80 84 84 110 84 74 94 88 90 92 84 94 [181] 92 140 116 115

We used the name `rate` for the stacked variable. We also need a
variable to keep track of which group each measurement came from. The R
`rep` (for repeat) function is useful for generating repetitive
data.

> B = rep("Before",92) > A = rep("After",92) > BA = c(B,A) > BA [1] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [9] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [17] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [25] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [33] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [41] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [49] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [57] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [65] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [73] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [81] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before" [89] "Before" "Before" "Before" "Before" "After" "After" "After" "After" [97] "After" "After" "After" "After" "After" "After" "After" "After" [105] "After" "After" "After" "After" "After" "After" "After" "After" [113] "After" "After" "After" "After" "After" "After" "After" "After" [121] "After" "After" "After" "After" "After" "After" "After" "After" [129] "After" "After" "After" "After" "After" "After" "After" "After" [137] "After" "After" "After" "After" "After" "After" "After" "After" [145] "After" "After" "After" "After" "After" "After" "After" "After" [153] "After" "After" "After" "After" "After" "After" "After" "After" [161] "After" "After" "After" "After" "After" "After" "After" "After" [169] "After" "After" "After" "After" "After" "After" "After" "After" [177] "After" "After" "After" "After" "After" "After" "After" "After" >

This creates a variable `B` that is just 92 instances of the
word "Before" and similarly for `A`. Then the two are
concatenated into ` BA`, the categorical variable that keeps track
of before and after. Now you can type

> boxplot(rate ~ BA)

to get boxplots.

The tilde "~" is used often in R. Here you can think of it as
saying we are going to see how `rate` depends on `BA`. You
can see that the "after" data show a higher median and more
variability along with high outliers not present in the "before"
data. The fact that both center and variability have changed makes it hard
to give a simple comparison here. If we looked only at the mean, we might
say that pulse rates went up by about seven points, and they did *on
average*. However, the lowest rates have not gone up much at all (The
minimum went up by 2.) while the highest have changed considerably (The
maximum went up by 40!). A simple average change does not describe what
happened here very accurately, which is one reason we always need to *make
a picture*!

Because the class contained both men and women, we might expect some bimodality in the heights and weights. Stem and leaf plots are R's default tool for assessing the shape of a distribution. (Boxplots tend to hide bimodality.)

> stem(Height) The decimal point is at the | 60 | 08 62 | 000080000 64 | 0000005 66 | 000000000000000 68 | 000000000000000000005 70 | 0000000000005 72 | 00000000000000055 74 | 00000000 > stem(Weight) The decimal point is 1 digit(s) to the right of the | 8 | 5 10 | 288002556688 12 | 000123555550000013555688 14 | 000025555580000000000355555555557 16 | 000045000055 18 | 000500005 20 | 5

Are the Heights trimodal or is that just random scatter? The weights look unimodal but looking at our data on a single scale is rarely enough.

> stem(Weight, scale=2) The decimal point is 1 digit(s) to the right of the | 9 | 5 10 | 288 11 | 002556688 12 | 00012355555 13 | 0000013555688 14 | 00002555558 15 | 0000000000355555555557 16 | 000045 17 | 000055 18 | 0005 19 | 00005 20 | 21 | 5

Now make boxplots as you did above but with Sex as the factor and Height as the measurement.

> boxplot(Height ~ Sex)

Both sexes show reasonably compact, symmetric distributions with no
outliers but the men are consistently about 5 inches taller than the
women. Because the two distributions have similar shapes and
variabilities, we can reasonably say that the men as a group are about 5
inches taller than the women. Compare this to the situation with the pulse
rates above, where such a simple description was an *over*simplification.
Note that although the two sexes are clearly different here, there is
enough overlap that the stem and leaf does not clearly show the two
groups.

We saw in the earlier boxplots of the pulse rates that the after rates were skewed toward high values with several possible outliers. When we see most of the alleged outliers on one side of a boxplot it may well be that we have a skewness problem rather than an outlier problem. Another sign is outliers that start close to the whiskers and gradually thin out. Another is a boxplot that looks skewed in the direction of the outliers even if we erase them. It is important to recognize these two different situations because the remedies are quite different. If we have an outlier problem, we need to investigate the individual points to see if we can find a reason for their unusual behavior. If we have a skewness problem we might use the median rather than the mean to describe the center, or we might re-express (transform) the data to make it more symmetric. Here are a number of common transformations and their effect on the after pulse readings. First, here is a stem and leaf of the original data.

5* | 04 5. | 6688 6* | 022224 6. | 666666888888 7* | 000000022244444 7. | 566666666666688 8* | 000002444444444 8. | 88 9* | 022444 9. | 68 10* | 0024 10. | 6 11* | 02 11. | 5688 12* | 12. | 8 13* | 13. | 14* | 0

Let's see if a square root transformation makes this less skewed.

> SquareRoot = sqrt(PuAfter) > stem(SquareRoot) The decimal point is at the | 7 | 13 7 | 556679999 8 | 01111112222224444444 8 | 5556666677777777777778899999 9 | 122222222244 9 | 56677789 10 | 00123 10 | 567899 11 | 3 11 | 8

This is better. Repeat with

ln = log(PuAfter)

to get natural logarithms (base *e*).

> ln = log(PuAfter) > stem(ln) The decimal point is 1 digit(s) to the left of the | 38 | 19 40 | 3366933336999999 42 | 22222255555558880000023333333333336688888 44 | 13333333338802244468 46 | 11246024577 48 | 54 > stem(ln, scale=2) The decimal point is 1 digit(s) to the left of the | 39 | 19 40 | 33669 41 | 33336999999 42 | 2222225555555888 43 | 0000023333333333336688888 44 | 133333333388 45 | 02244468 46 | 11246 47 | 024577 48 | 5 49 | 4

Going from the original data to the square roots to the logarithms
there is successively less skewness. There is still some skewness present
in the logs and so we will try two stronger transformations: negative
reciprocal roots and negative reciprocals. The reason for the negatives
here is that when you take reciprocals you reverse up and down in the
graph. This is a nuisance when comparing several graphs, so we introduce
the minus sign to switch things back again. For most purposes, however, we
do *not* do that. For example, if you measure fuel economy in miles
per gallon and take the reciprocal you get gallons per mile -- a perfectly
good, but different, way to measure fuel consumption. No one would
normally put a negative sign in front of either measurement. However, we
*do* have to mentally remember that high MPG is good while high GPM
is bad, even though in some sense they measure the same thing.

You can use the techniques described above to get these transformations.

> nrr = -1/sqrt(PuAfter) > stem(nrr) The decimal point is 2 digit(s) to the left of the | -14 | 1 -13 | 64411 -12 | 9777753333331111110000000 -11 | 88866666555555555555533222220 -10 | 999999999775443332100 -9 | 987543322 -8 | 85 > stem(nrr,scale=2) The decimal point is 2 digit(s) to the left of the | -14 | 1 -13 | 6 -13 | 4411 -12 | 977775 -12 | 3333331111110000000 -11 | 888666665555555555555 -11 | 33222220 -10 | 999999999775 -10 | 443332100 -9 | 9875 -9 | 43322 -8 | 85 > nr = -1/PuAfter > stem(nr) The decimal point is 3 digit(s) to the left of the | -20 | 0 -18 | 5 -16 | 992271111 -14 | 62222227777773333333 -12 | 99955555322222222222288555552 -10 | 999999999441996664200 -8 | 864197655 -6 | 81 > stem(nr, scale=2) The decimal point is 3 digit(s) to the left of the | -20 | 0 -19 | -18 | 5 -17 | 9922 -16 | 71111 -15 | 6222222 -14 | 7777773333333 -13 | 999555553222222222222 -12 | 88555552 -11 | 999999999441 -10 | 996664200 -9 | 8641 -8 | 97655 -7 | 81

Here four common transformations were applied in order of increasing strength for reducing skewness toward high vales: square roots, logarithms (here to base e), negative reciprocal roots, and negative reciprocals. The last two transformations seem to have made the distribution most symmetric.

©2008 Robert W. Hayden