These examples will use the heart attack data which comes with this description of its variables:
Heart Attack Patients This set of data is all of the hospital discharges in New York State with an admitting diagnosis of an Acute Myocardial Infarction (AMI), also called a heart attack, who did not have surgery, in the year 1993. There are 12,844 cases. AGE gives age in years SEX is coded M for males F for females DIAGNOSIS is in the form of an International Classification of Diseases, 9th Edition, Clinical Modification code. These tell which part of the heart was affected. DRG is the Diagnosis Related Group. It groups together patients with similar management. In this data set there are just three different drgs. 121 for AMIs with cardiovascular complications who did not die. 122 for AMIs without cardiovascular complications who did not die. 123 for AMIs where the patient died. LOS gives the hospital length of stay in days. DIED has a 1 for patients who died in hospital and a 0 otherwise. CHARGES gives the total hospital charges in dollars. Data provided by Health Process Management of Doylestown, PA.
This is a very large data set and so is provided as zip files. (You may need a program such as winzip to unzip them). Available are plain text (with tabs separating entries) and Excel versions of the data.
R-Commander can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. You can access such a file named heartatk4R.txt. Download and save this file in the directory where the R program lives.To open the heart attack data in R-Commander, first start R, then type library(Rcmdr) into the session window. When R-Commander opens, select
Data > Import Data > From text file.
Type heartatk into the name window, click on the little circle for Tabs, and then click on the OK button. This will bring up a standard file selection window. Browse to wherever you saved the file heartatk4R.txt and open it. Select
Statistics > Summaries > Active Dataset
and you should see
Patient DIAGNOSIS SEX DRG DIED Min. : 1 Min. :41001 F:5065 Min. :121.0 Min. :0.0000 1st Qu.: 3212 1st Qu.:41041 M:7779 1st Qu.:121.0 1st Qu.:0.0000 Median : 6423 Median :41071 Median :122.0 Median :0.0000 Mean : 6423 Mean :41060 Mean :121.7 Mean :0.1098 3rd Qu.: 9633 3rd Qu.:41091 3rd Qu.:122.0 3rd Qu.:0.0000 Max. :12844 Max. :41091 Max. :123.0 Max. :1.0000 CHARGES LOS AGE Min. : 3 Min. : 0.000 Min. : 20.00 1st Qu.: 5422 1st Qu.: 4.000 1st Qu.: 57.00 Median : 8445 Median : 7.000 Median : 67.00 Mean : 9879 Mean : 7.569 Mean : 66.29 3rd Qu.:12569 3rd Qu.:10.000 3rd Qu.: 77.00 Max. :47910 Max. :38.000 Max. :103.00 NA's : 699
If, as in this case, there is text in the first row of the data file, R-Commander interprets this as variable names. If there is one fewer labels than columns of data, the first column is interpreted as case identifiers. Here the first column is a case identifier, but since a label is present in the data file, the column is interpreted as a variable, and since the identifiers are numeric, they are treated as measurements. As a result, the summary statistics for "Patient" do not mean much. R-Commander tries to guess whether each variable is quantitative or qualitative. DIAGNOSIS and DRG are medical diagnosis categories coded as numbers. They are treated here as if the numbers contained quantitative information, which they do not. For our purposes, these misidentifications are not a problem. SEX and DIED are categorical variables. SEX is coded M-F so it is recognized as categorical and the summary is a count of how many observations fell in each category. DIED is coded 0-1 and so is interpreted as numbers. Hence most of the summary statistics are useless, though of course the mean is just the proportion who died (10.98%). The last three variables are correctly identified as numeric. It is interesting to note the extremes: CHARGES from $3 to $47,910, Length Of Stay from 0 to 38 days, and AGE from 20 to 103 years old.
It will be easier to work with the data on how many people died if we reeducate R-Commander to treat it as categorical. Actually, we will create a new categorical variable and keep the old numeric codes so we can have the best of both worlds. Select
Data > Manage variables in active dataset > Convert numeric variable to factor
Select the DIED variable and type in a new name so as to retain the old variable and create a new one. You could call the new variable DIEDTEXT. Click on OK and you will get a new dialog box listing the values found in the old variable and giving you spaces to fill in the new values. Type in No for 0 and Yes for 1. Click OK. Select
Statistics > Summaries > Frequency distributions
You should see DIEDTEXT listed among the categorical variables. select it and click on OK. In amongst some R code you should see a table of counts
No Yes 11434 1410
and of percents.
No Yes 89.02211 10.97789
1410 of the patients died. A single command gives confidence intervals and tests any hypothetical p0 specified. Ignore the X-squared value and use the p-value for a hypothesis test. Select
Statistics > Proportions > Single sample proportion test
Select DIEDTEXT as your variable and change the null hypothesis value to 0.1. Leave the other defaults alone. Click on OK.
1-sample proportions test without continuity correction data: rbind(.Table), null probability 0.1 X-squared = 89115.87, df = 1, p-value < 2.2e-16 alternative hypothesis: true p is not equal to 0.1 95 percent confidence interval: 0.8846976 0.8955113 sample estimates: p 0.8902211
Here R-Commander is telling us that we have to read computer outputs very carefully! From the sample proportion given of 0.8902211 we can see that we have a confidence interval for the proportion who did not die -- and have tested the hypothesis that 10% did not die. This is the price of automation. R-Commander made things "easy" for us by deciding which outcome we wanted a confidence interval for. Unfortunately, it guessed wrong. There are two possible solutions. A mathematical approach is to redo the command the way the software wants, but entering 0.9 as the null value (90% did not die corresponds to 10% did die).
1-sample proportions test without continuity correction data: rbind(.Table), null probability 0.9 X-squared = 13.647, df = 1, p-value = 0.0002206 alternative hypothesis: true p is not equal to 0.9 95 percent confidence interval: 0.8846976 0.8955113 sample estimates: p 0.8902211
Now we have the correct p-value for the hypothesis test (and reject the null because it is so small). To fix the rest subtract the given values from 1 to get a sample estimated proportion of those who died of 1-0.8902211 = 0.1097789, with a confidence interval going from 1-0.8955113 = 0.1044907 to 1-0.8846976 = 0.1153024. The other option is to type in an R command to do it right. In the upper half of the R-Commander double window, move the cursor to a blank line at the bottom, type in
and click on the Submit button between the upper and lower subwindow. The general pattern for this command is
prop.test(number of outcomes of interest, number of other outcomes, proportion of outcomes of interest hypothesized)
where the numbers are taken from the summary tables above. The output in this case is
1-sample proportions test with continuity correction data: 1410 out of 12844, null probability 0.1 X-squared = 13.5385, df = 1, p-value = 0.0002337 alternative hypothesis: true p is not equal to 0.1 95 percent confidence interval: 0.1044507 0.1153421 sample estimates: p 0.1097789
This is close to what we got doing the math. What R-Commander did before we fixed things amounted to
where the counts are reversed.
R-Commander makes it easy to get a lot of things done, but sometimes the things it does so automatically are not what we want. In those cases it is good to have the full power of R available.
©2006-2007 Robert W. Hayden