This page uses the PULSE data. If you are not already familiar with it, you should read a description of the data. You can find the entire dataset at our site as a plain text file or as an Excel spreadsheet. Download the text version and save it in the directory where you installed R. After that, run R and type

> PULSE = read.table("pulse.txt",header=TRUE) > attach(PULSE) > names(PULSE) [1] "PuBefor" "PuAfter" "Ran." "Smokes." "Sex" "Height" [7] "Weight" "ActivityL"

You can get an assortment of summary statistics for all your
categorical variables by using the `summary` command. The
variables R recognizes as categorical (because they are text) are Ran.,
Smokes. and Sex. (The periods are there because the source file has
question marks in those locations and question marks are illegal in R
variable names.)

> summary(Ran.) no yes 57 35 > summary(Smokes.) no yes 64 28 > summary(Sex) female male 35 57

A table is usually the best summary for categorical data. Once we have a table we should look at it and say something sensible. A very short verbal summary here is that most of these students are men, most of them did not run, and most of them do not smoke. We should also note anything unusual or unexpected in the data. Here we might wonder about the imbalance between the sexes. At least in the United States, college students have been roughly evenly balanced between the sexes for decades. Why such a preponderance of males here? Is this a course for engineering majors? Were the data gathered many years ago, or in a country where fewer women go to college? These are the kinds of things a good analyst looks for and questions. We might have similar questions about how few ran. Was this really decided by a fair coin toss, or did we have some non-compliance for this "treatment"?

Now let's look at the relationship between the two categorical variables Sex and Smokes.

> table(Sex, Smokes.) Smokes. Sex no yes female 27 8 male 37 20

Such tables usually include row and column totals. To get those we use a powerful R idea: save the results of a procedure and pass it to another.

> tab = table(Sex, Smokes.) > addmargins(tab) Smokes. Sex no yes Sum female 27 8 35 male 37 20 57 Sum 64 28 92

Of course there is no point in getting this table unless we can interpret it. One thing we might be interested in is whether there is a difference in the prevalence of smoking between the two sexes. 8 out of 35 females smoke while 20 out of 57 males smoke. Those are hard to compare unless we change to a common denominator, or express them as proportions or percents. We suggest you grab your trusty calculator (or maybe there is one built into your computer). We see that 8 out of 35 or about 23% of the females smoke and 20 out of 57 or about 35% of the males, so smoking is more common among males in this group of students. R can do the arithmetic for you.

> prop.table(tab,1) Smokes. Sex no yes female 0.7714286 0.2285714 male 0.6491228 0.3508772

The "1" tells R to compare the sexes in the rows. To compare smokers to non-smokers, compute column percents.

> prop.table(tab,2) Smokes. Sex no yes female 0.4218750 0.2857143 male 0.5781250 0.7142857

Note that in R you can use the up-arrow key to recall previous commands, so the latest command could be created by using the up-arrow once and changing the 1 to a 2.

Now let's look at a very trivial issue that we discuss only because it often leads to confusion for beginners. Depending on how we select our variables in a two-way table, we can get different looking tables.

> table(Smokes., Sex) Sex Smokes. female male no 27 37 yes 8 20 >

Here the rows and columns are interchanged compared to our original
table. *There is no right (or wrong) way to do this*! Often
the choice is determined by non-statistical issues like fitting the table
on a page or overhead. The only reason it is worth mentioning is to warn
you not to memorize any rules for working with tables that include the
words "row" or "column", since the same information
could be in either a row or a column, depending on how the table is laid
out.

In looking at a table we can think in terms of counts or of proportions, such as 28 out of 92 smoke. We can also think of thie latter as a probability. If we pick a person at random from this group, the probability that they smoke is 28 out of 92. In some cases, this is all we want to know. In other cases, this might be an estimate of some other probability or proportion -- perhaps we have a sample value and want to look at a larger population. In what follows, we will talk mainly in terms of probabilities. We will also try to match up what we can get from the table with probability terminology and notation.

From any of our tables, we can see that the probability that a person
selected at random smokes is 28/92 = 0.3 or 30%. The probability that they
were male is 57 out of 92 or 62%. *Simple probabilities come from the
sum rows and columns. * The probability that a person does not smoke
can be found as 64/92 or by the complement rule as 1-(the probability that
they smoke) or 1-(28/92). Both approaches should give 70%. (In theoretical
work and in doing arithmetic we usually use the proportion 0.7 but when
interpreting results most people prefer percentages.) It is important to
recognize how these rules play out in tables, because this sort of data is
almost always presented in tables!

There is also a rule for probabilities with "and', but it works
only for independent events. These are important in theory but rare in
practice;-) In practice we have to count. From the table, we can see that
there were 8 people who were female AND smoked. Hence the correct
probability for this is 8/92=8.7%. The independent event formula would
give (28/92)*(35/92)=11.6% -- close but not real close. *Probabilities
with and are generally found with a total percents table.*

> prop.table(tab) Smokes. Sex no yes female 0.29347826 0.08695652 male 0.40217391 0.21739130 > addmargins(prop.table(tab)) Smokes. Sex no yes Sum female 0.29347826 0.08695652 0.38043478 male 0.40217391 0.21739130 0.61956522 Sum 0.69565217 0.30434783 1.00000000

The probability of being a female smoker (which we calculated a moment ago) is highlighted in red in the copy of that table immediately above. The probability of being a male (and a) non-smoker is 40.217391%.

Disjoint (also called "mutually exclusive") events connect with tables in two ways. First, when you set up each categorical variable, the categories should be disjoint. People should have just one activity level, and either they ran or they did not. If you open a well-constructed data file, this should already be taken care of. You may have to be more careful if you set up a data file yourself. For example, you may have a survey question that asks people to check a list of hobbies they have. Since people may have more than one hobby, your hobbies may not form disjoint sets. The standard way to deal with this is to represent each hobby choice with a yes-no variable.

We may also see disjointness between certain values of *different*
variables. For example, if we are studying the prevalence of various forms
of cancer and comparing males and females, we will find no males with
ovarian cancer and no females with prostate cancer. These are two examples
of disjoint events and we would see two 0's in the contingency table. On
the other hand, when we see 0's, we always wonder if there is some reason
(biological in our example) why the events are disjoint, or is the 0 just
a peculiarity of this set of observations.

Conditional probabilities are computed in row percent and column percent
tables. In fact, the meaning of conditional probabilities is *much*
clearer in tables than it is in language or mathematical notation. The
idea of a conditional probability is that you are looking at a subset of
the data. For example, in an election poll we might be interested in the
proportion of voters who prefer CandidateA, and also be interested in what
that proportion is among certain subsets, such as men, women or blacks.
For the PULSE data, we saw that about 30% of the 92 people smoked.
However, for the subgroup of females, only 8 out of 35 or about 23% smoke.
Often we want to compare one subset to another. Here 20/57 males or about
35% smoke. We noted this earlier and found those numbers in the table. The
notation for these conditional probabilities might look something like
P(smokes | female) and P(smokes | male) respectively. These are row
percents because probabilities are computed with the row totals as
denominators. The subgroups are males and females.

> addmargins(prop.table(tab,1)) Smokes. Sex no yes Sum female 0.7714286 0.2285714 1.0000000 male 0.6491228 0.3508772 1.0000000 Sum 1.4205514 0.5794486 2.0000000

(Don't take that last row too seriously.) We can also compare smokers to non-smokers.

> addmargins(prop.table(tab,2)) Smokes. Sex no yes Sum female 0.4218750 0.2857143 0.7075893 male 0.5781250 0.7142857 1.2924107 Sum 1.0000000 1.0000000 2.0000000

(Don't take that last column too seriously.) 71% of the smokers were male. The notation for this conditional probability might look something like P(male |smokes). It's not the same as P(smokes | male)=35%. Now the subgroups are smokers and non-smokers. Recall that what is a row and what is a column is arbitrary, so in practice you have to ask yourself, "Do I want to compare males to females or smokers to non-smokers?" and not "Do I want row percents or column percents?" Putting these into words may help to see the difference and how these arise in practice. P(male |smokes) is about the 28 people who smoke. Of those 28, what proportion were male?

Independence is closely related to conditional probabilities. If gender and smoking were independent, then a column percents table might look like this:

no yes Rowtotal female 0.35 0.35 0.35 male 0.65 0.65 0.65

with the percent of females the same for smokers and non-smokers and for
the group as a whole. "Independence" can be a tricky word in
ordinary English, and is even more so in statistics. Independence in the
table above means that the proportion of females is the same for both
smokers and non-smokers. But smoking and gender are *de*pendent in
the sense that if I know the percentage of smokers who are female, and I
know the two are independent, then I know the percentage of non-smokers
who are female. Ironically, statistical independence puts very tight
restrictions on what a two-way table can look like. Rarely do we see
complete independence in real data and often the question is how close we
come to independence. Here percentages of females between smokers and
non-smokers are in the ballpark but not really close (28.6% versus 42.2%).

© 2008 Robert W. Hayden