Inference for Two Proportions in R

Do "old" people snore more than young? A random sample of 995 people were surveyed and categorized by whether they were "young" (184 people under 30) or "old" (811 people over 30).

> prop.test(x=c(318,48), n=c(811,184), conf.level=0.95)

        2-sample test for equality of proportions with continuity correction

data:  c(318, 48) out of c(811, 184) 
X-squared = 10.5513, df = 1, p-value = 0.001161
alternative hypothesis: two.sided 
95 percent confidence interval:
 0.05610974 0.20636815 
sample estimates:
   prop 1    prop 2 
0.3921085 0.2608696 

The x counts are for what we might call "success." Here success is snorin; 318 old people and 48 young people did it.. The order in which these are entered is the same as the order of subtraction for computing the difference, so the confidence interval is for the difference old-young. We are comparing two age groups so the sizes of the age groups go in as n. Note that you get both a hypothesis test and a confidence interval. The results may not exactly agree with what you get using textbook formulae because R is making some tweaks ("continuity correction") that are not worth the effort if you are doing things by hand.

If you do not have the counts, but you have the data in variables in R, you can use the R table command to get the counts. This also checks for some types of gross errors, such as an 11 in a column that is supposed to be 0-1. We will use the heart attack data as an example. You have to reattach the table each time you open R.

> table(SEX,DIED)
Error in table(SEX, DIED) : object "SEX" not found
> attach(heartatk)
> table(SEX,DIED)
SEX    0    1
  F 4298  767
  M 7136  643

R is being mean and not returning the row and column totals. We can get those with repeated use of table and length.

> table(SEX)
   F    M 
5065 7779 
> table(DIED)
    0     1 
11434  1410 
> length(SEX)
[1] 12844

It would make a good exercise to put these totals in their proper place in the original table. Page layout is no guide to what is statistically correct!

We will compare the mortality rates of males and females. This amounts to labeling death as "success". We need the numbers who died in each group for x and the total number of people in each group (total males and total females) for n. Make sure you enter these in consistent order. We are very old and followed the old-fashioned rule of "ladies first" -- for both x and n.

> prop.test(x=c(767,643),n=c(5065,7779))

        2-sample test for equality of proportions with continuity correction

data:  c(767, 643) out of c(5065, 7779) 
X-squared = 147.7612, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided 
95 percent confidence interval:
 0.05699518 0.08055073 
sample estimates:
    prop 1     prop 2 
0.15143139 0.08265844 

"Ladies first" means we subtracted F-M so the positive numbers in the confidence interval mean the mortality rate was higher for women. The fact that it does not include zero means that the difference is unlikely to be due to sampling error. (Bear in mind that this is not a random sample so we need to be very cautious in extrapolating to other states or years.) The tiny p-value confirms this.

© 2006 Robert W. Hayden