R can do all the usual chi-square tests, with either raw data or tables of counts. We start with the heart attack data and the Goodness-of-Fit test.

> names(heartatk) [1] "Patient" "DIAGNOSIS" "SEX" "DRG" "DIED" "CHARGES" [7] "LOS" "AGE" > attach(heartatk) > table(SEX) SEX F M 5065 7779 > chisq.test(table(SEX)) Chi-squared test for given probabilities data: table(SEX) X-squared = 573.4815, df = 1, p-value < 2.2e-16

The command `chisq.test(table(SEX))` does a chi-square goodness
of fit test on the table for the `SEX` variable. The default is to
test for equal expected counts in every cell. That is hardly the case
here. 2.2e-16 means 2.2X10^{-16} = 0.00000000000000022, which is
small.

If you do not want equal proportions, you need to give a set of *proportions*
for each cell. For example, genetic theory predicts that certain fruit
flies will fall into four categories in proportions 9:3:3:1. Data showed
counts of 59, 20, 11 and 10.

> chisq.test(c(59,20,11,10),p=c(9/16,3/16,3/16,1/16)) Chi-squared test for given probabilities data: c(59, 20, 11, 10) X-squared = 5.6711, df = 3, p-value = 0.1288

We would not reject the theoretical hypothesis with these data.

You can make a cross-tabulation of two categorical variables with `table`
and do a test of independence or homogeneity with `chisq.test`.
(We return to the heart attack data.)

> table(SEX,DIED) DIED SEX 0 1 F 4298 767 M 7136 643 > chisq.test(table(SEX,DIED)) Pearson's Chi-squared test with Yates' continuity correction data: table(SEX, DIED) X-squared = 147.7612, df = 1, p-value < 2.2e-16

We have seen this *p*-value before! It is probably the smallest
non-zero number R can handle and hence not very accurate. However, *p*
is definitely small! Hence we reject the hypothesis that the mortality
rate is the same for men and women. Looking at the data, it is higher for
men.

With 12,844 observations, *getting* the table is a lot more work
than computing chi-square, and it is best to let the computer do it. If
you have an existing table, R will analyze it -- but not without putting
up a fight. You need to enter the table one row at a time and use `rbind`
to combine the rows into a table. Here is some data on hepatitis C
incidence (yes,no) and tattoo "status": from a parlor,
elsewhere, and no tattoo.

> hep=rbind(c(17,35),c(8,53),c(22,491)) > hep [,1] [,2] [1,] 17 35 [2,] 8 53 [3,] 22 491 > chisq.test(hep) Pearson's Chi-squared test data: hep X-squared = 57.9122, df = 2, p-value = 2.658e-13 Warning message: Chi-squared approximation may be incorrect in: chisq.test(hep)

The warning message probably refers to the small expected count the cell where the 8 appears. You could overcome that by pooling the two tattoo sources together. Either way, getting tattoos seems to greatly increase the risk of hepatitis C.

© 2006 Robert W. Hayden