Inference for One Proportion in R-Commander

These examples will use the heart attack data which comes with this description of its variables:

Heart Attack Patients
This set of data is all of the hospital discharges in New York State 
with an admitting diagnosis of an Acute Myocardial Infarction (AMI), 
also called a heart attack, who did not have surgery, in the year 1993. 
There are 12,844 cases.

AGE gives age in years

SEX is coded M for males F for females

DIAGNOSIS is in the form of an International Classification of Diseases, 
9th Edition, Clinical Modification code. These tell which part of the 
heart was affected.

DRG is the Diagnosis Related Group. It groups together patients with 
similar management. In this data set there are just three different drgs.

121 for AMIs with cardiovascular complications who did not die.
122 for AMIs without cardiovascular complications who did not die.
123 for AMIs where the patient died.

LOS gives the hospital length of stay in days.

DIED has a 1 for patients who died in hospital and a 0 otherwise.

CHARGES gives the total hospital charges in dollars.

Data  provided by Health Process Management of Doylestown, PA.

This is a very large data set and so is provided as zip files. (You may need a program such as winzip to unzip them). Available are plain text (with tabs separating entries) and Excel versions of the data.

R-Commander can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. You can access such a file named heartatk4R.txt. Download and save this file in the directory where the R program lives.To open the heart attack data in R-Commander, first start R, then type library(Rcmdr) into the session window. When R-Commander opens, select

Data > Import Data > From text file.

Type heartatk into the name window, click on the little circle for Tabs, and then click on the OK button. This will bring up a standard file selection window. Browse to wherever you saved the file heartatk4R.txt and open it. Select

Statistics > Summaries > Active Dataset

and you should see

    Patient        DIAGNOSIS     SEX           DRG             DIED       
 Min.   :    1   Min.   :41001   F:5065   Min.   :121.0   Min.   :0.0000  
 1st Qu.: 3212   1st Qu.:41041   M:7779   1st Qu.:121.0   1st Qu.:0.0000  
 Median : 6423   Median :41071            Median :122.0   Median :0.0000  
 Mean   : 6423   Mean   :41060            Mean   :121.7   Mean   :0.1098  
 3rd Qu.: 9633   3rd Qu.:41091            3rd Qu.:122.0   3rd Qu.:0.0000  
 Max.   :12844   Max.   :41091            Max.   :123.0   Max.   :1.0000  
                                                                          
    CHARGES           LOS              AGE        
 Min.   :    3   Min.   : 0.000   Min.   : 20.00  
 1st Qu.: 5422   1st Qu.: 4.000   1st Qu.: 57.00  
 Median : 8445   Median : 7.000   Median : 67.00  
 Mean   : 9879   Mean   : 7.569   Mean   : 66.29  
 3rd Qu.:12569   3rd Qu.:10.000   3rd Qu.: 77.00  
 Max.   :47910   Max.   :38.000   Max.   :103.00  
 NA's   :  699  

If, as in this case, there is text in the first row of the data file, R-Commander interprets this as variable names. If there is one fewer labels than columns of data, the first column is interpreted as case identifiers. Here the first column is a case identifier, but since a label is present in the data file, the column is interpreted as a variable, and since the identifiers are numeric, they are treated as measurements. As a result, the summary statistics for "Patient" do not mean much. R-Commander tries to guess whether each variable is quantitative or qualitative. DIAGNOSIS and DRG are medical diagnosis categories coded as numbers. They are treated here as if the numbers contained quantitative information, which they do not. For our purposes, these misidentifications are not a problem. SEX and DIED are categorical variables. SEX is coded M-F so it is recognized as categorical and the summary is a count of how many observations fell in each category. DIED is coded 0-1 and so is interpreted as numbers. Hence most of the summary statistics are useless, though of course the mean is just the proportion who died (10.98%). The last three variables are correctly identified as numeric. It is interesting to note the extremes: CHARGES from $3 to $47,910, Length Of Stay from 0 to 38 days, and AGE from 20 to 103 years old.

It will be easier to work with the data on how many people died if we reeducate R-Commander to treat it as categorical. Actually, we will create a new categorical variable and keep the old numeric codes so we can have the best of both worlds. Select

Data > Manage variables in active dataset > Convert numeric variable to factor

Select the DIED variable and type in a new name so as to retain the old variable and create a new one. You could call the new variable DIEDTEXT. Click on OK and you will get a new dialog box listing the values found in the old variable and giving you spaces to fill in the new values. Type in No for 0 and Yes for 1. Click OK. Select

Statistics > Summaries > Frequency distributions

You should see DIEDTEXT listed among the categorical variables. select it and click on OK. In amongst some R code you should see a table of counts

   No   Yes 
11434  1410 

and of percents.

      No      Yes 
89.02211 10.97789 

1410 of the patients died. A single command gives confidence intervals and tests any hypothetical p0 specified. Ignore the X-squared value and use the p-value for a hypothesis test. Select

Statistics > Proportions > Single sample proportion test

Select DIEDTEXT as your variable and change the null hypothesis value to 0.1. Leave the other defaults alone. Click on OK.

	1-sample proportions test without continuity correction

data:  rbind(.Table), null probability 0.1 
X-squared = 89115.87, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.1 
95 percent confidence interval:
 0.8846976 0.8955113 
sample estimates:
        p 
0.8902211 

Here R-Commander is telling us that we have to read computer outputs very carefully! From the sample proportion given of 0.8902211 we can see that we have a confidence interval for the proportion who did not die -- and have tested the hypothesis that 10% did not die. This is the price of automation. R-Commander made things "easy" for us by deciding which outcome we wanted a confidence interval for. Unfortunately, it guessed wrong. There are two possible solutions. A mathematical approach is to redo the command the way the software wants, but entering 0.9 as the null value (90% did not die corresponds to 10% did die).

	1-sample proportions test without continuity correction

data:  rbind(.Table), null probability 0.9 
X-squared = 13.647, df = 1, p-value = 0.0002206
alternative hypothesis: true p is not equal to 0.9 
95 percent confidence interval:
 0.8846976 0.8955113 
sample estimates:
        p 
0.8902211 

Now we have the correct p-value for the hypothesis test (and reject the null because it is so small). To fix the rest subtract the given values from 1 to get a sample estimated proportion of those who died of 1-0.8902211 = 0.1097789, with a confidence interval going from 1-0.8955113 = 0.1044907 to 1-0.8846976 = 0.1153024. The other option is to type in an R command to do it right. In the upper half of the R-Commander double window, move the cursor to a blank line at the bottom, type in

prop.test(1410,12844,p=0.1)

and click on the Submit button between the upper and lower subwindow. The general pattern for this command is

prop.test(number of outcomes of interest, number of other outcomes, proportion of outcomes of interest hypothesized)

where the numbers are taken from the summary tables above. The output in this case is

	1-sample proportions test with continuity correction

data:  1410 out of 12844, null probability 0.1 
X-squared = 13.5385, df = 1, p-value = 0.0002337
alternative hypothesis: true p is not equal to 0.1 
95 percent confidence interval:
 0.1044507 0.1153421 
sample estimates:
        p 
0.1097789 

This is close to what we got doing the math. What R-Commander did before we fixed things amounted to

prop.test(12844,1410,p=0.1)

where the counts are reversed.

R-Commander makes it easy to get a lot of things done, but sometimes the things it does so automatically are not what we want. In those cases it is good to have the full power of R available.


©2006-2007 Robert W. Hayden